Published 2026-05-31 · 21 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If your video product turns speech into text while a person is still speaking, you are running streaming ASR, whether you built it or bought it. A live-captioning overlay on a webinar needs words on screen within a second or two of being spoken. A voice agent that books appointments needs to know the moment the caller stopped talking so it can answer without cutting them off. A meeting note-taker needs a running transcript it can summarise the instant the call ends. A telemedicine consultation needs an accurate, low-latency record of a doctor's spoken notes. All four are the same problem — converting a live audio stream into incremental text — and all four break in the same ways: words appear too late, the early guesses keep rewriting themselves, the system talks over the user, or the bill is ten times what the prototype suggested. This article gives a product manager, founder, or engineering lead enough understanding to choose between the three dominant options in 2026, to know what "300 milliseconds of latency" actually buys, and to avoid the cost and accuracy traps that sink most first deployments.
What "Streaming ASR" Actually Means
Start with the two words. Automatic speech recognition, abbreviated ASR, is the technology that takes recorded sound containing speech and produces the text of what was said. It is the engine inside dictation, captions, and voice assistants. The word streaming describes when the work happens: the audio is processed as it arrives, live, rather than after a complete recording has been saved. The opposite of streaming is batch — you hand the system a finished audio file and wait for the full transcript to come back.
The difference matters more than it sounds. In batch mode, the system has the luxury of seeing the whole recording before it commits to any word. It can use what comes after a tricky word to decide what that word was, the way you can re-read a confusing sentence to the end before you understand its first half. In streaming mode, the system has to emit text while the speaker is mid-sentence, with no knowledge of what comes next. It is the difference between translating a speech you have a transcript of and interpreting one live in a booth. Streaming is harder, and it is harder in ways that show up directly in the user experience.
The single most important concept in streaming ASR is the split between a partial result and a final result. A partial is the system's best current guess, and it is allowed to change as more audio arrives. A final is a word the system has committed to and will not revise. Picture live captions where the text flickers and rewrites itself for a moment before settling — the flickering text is a stream of partials, and the moment it stops changing is the final. Every streaming system makes a trade between showing partials early (fast, but jittery and sometimes wrong) and waiting for finals (stable, but slower). Understanding that trade is most of understanding streaming ASR.
Figure 1. Batch ASR sees the whole recording before committing to a word; streaming ASR must emit text mid-sentence, trading early partials against stable finals.
How A Speech Model Hears: The 30-Second Window
To understand why streaming is hard, you need one fact about how modern speech models work. A model like Whisper does not listen to raw sound. The audio is first turned into a mel spectrogram — a picture of how the energy in the sound is spread across pitches over time, scaled to roughly match how the human ear perceives loudness. If you want the underlying audio-engineering detail of what a mel spectrogram is and why it is shaped that way, that belongs to a different topic and we cover it in the audio fundamentals for video lesson; here, treat it as "the image of the sound that the model actually reads".
Whisper reads that image in fixed blocks of 30 seconds. That number is baked into the model: it was trained on 30-second windows, and it expects 30 seconds of context to do its best work (Radford et al., 2022). For batch transcription this is fine — you slice the recording into 30-second pieces and the model handles each one. For streaming it is a problem, because you do not have 30 seconds of future audio when the speaker has only said three words. You either wait 30 seconds (unacceptable latency) or you feed the model a short, incomplete window and accept that its guess will be worse and may change.
This is the root cause of nearly every streaming-ASR difficulty. The model wants context; streaming denies it context. Everything below is a strategy for giving the model as much context as possible while still emitting text quickly.
Option 1 — Whisper, Made To Stream
Whisper is OpenAI's open-source speech model, released in 2022 and published under the permissive MIT license, which means you can run it on your own servers for free and ship it inside a commercial product (Radford et al., 2022). It is a Transformer encoder-decoder — the encoder reads the mel spectrogram of the audio, and the decoder writes out the text one token at a time, the same architecture family that powers large language models. It was trained on 680,000 hours of audio collected from the web, labelled by what the paper calls weak supervision: the training pairs were not hand-checked by experts but harvested at scale, which is why Whisper holds up unusually well across accents, background noise, and 99 languages, and also why it occasionally hallucinates — emitting fluent text for audio that contained no speech, such as during a long silence.
The catch is the one from the previous section: Whisper is batch-only by design. The decoder was never built to emit text incrementally. To make it stream, you wrap it in a loop that repeatedly runs the batch model on a growing buffer of audio and decides which words are stable enough to commit. The open-source reference for that loop is Whisper-Streaming, published by researchers at Charles University in 2023 (Macháček et al., 2023).
The LocalAgreement Trick
The heart of Whisper-Streaming is a rule for deciding when a word is safe to commit, called LocalAgreement-2. The idea is simple enough to state in one sentence: only commit a word once two consecutive passes of the model agree on it. The system keeps a buffer of recent audio. Every time a new chunk of sound arrives, it runs Whisper on the whole buffer and gets a fresh transcript. It then compares the new transcript with the previous one and finds the longest common prefix — the longest run of words from the start that both transcripts share. Those agreed words become finals; everything after them stays a partial, waiting for the next pass to confirm it.
This works because a word that two independent passes both produce, from slightly different amounts of audio, is very likely correct. A word that appears in one pass and vanishes in the next was a guess the model abandoned once it heard more. LocalAgreement-2 was not invented for Whisper — it came from simultaneous-translation research and won the IWSLT 2022 shared task — and the Whisper-Streaming authors found that agreeing over two passes (the "2" in the name) was the sweet spot between speed and stability (Macháček et al., 2023).
A few supporting tricks make the loop practical. The buffer is trimmed whenever a sentence ends, so it never grows past Whisper's 30-second limit. The last 200 confirmed words are fed back into the model as a prompt, giving it the inter-sentence context it would otherwise lose. And an optional voice activity detector, abbreviated VAD — a small, cheap model that decides whether a slice of audio contains speech at all — can be switched on to skip silence, which both improves accuracy and cuts cost.
Figure 2. Whisper-Streaming confirms a word only when two consecutive passes agree on it — the longest common prefix becomes final; the tail stays a partial.
What It Costs In Latency And Hardware
The numbers from the Whisper-Streaming paper are the honest baseline. On unsegmented long-form English speech, running the large model on an NVIDIA A40 GPU, the system reached an average latency of about 3.3 seconds with a minimum chunk size of one second, and word error rate roughly 2 percentage points worse than the same model in batch mode (Macháček et al., 2023). Latency is sensitive to hardware and to the chunk size: a smaller chunk lowers latency but gives the model less context, raising error rate.
A practical note on speed. The reference implementation does not run the original OpenAI Whisper code; it uses faster-whisper, a reimplementation built on the CTranslate2 inference engine that the authors report runs roughly four times faster than the standard implementation at 16-bit precision (Macháček et al., 2023). If you build on Whisper for streaming, you almost certainly run faster-whisper or a comparable optimised runtime, not the vanilla model — the vanilla model is too slow to keep up with live audio on affordable hardware.
The reason to choose Whisper is control and economics at scale. It is free to license, runs entirely inside your own infrastructure (which matters for medical or regulated audio that cannot leave your servers), and supports 99 languages out of one model. The reason not to choose it is that you now own a GPU fleet, a streaming loop, and a latency budget that a hosted API would have handed you for a few cents an hour.
Option 2 — Deepgram, Built For Real Time
Deepgram is a hosted speech-to-text API designed for streaming from the start, rather than a batch model bent into a streaming shape. You open a websocket — a two-way network connection that stays open so audio can flow up and text can flow down continuously — push audio chunks into it, and receive partial and final transcripts back. Its 2026 flagship model is Nova-3.
The numbers Deepgram publishes for Nova-3 are competitive at the front of the field: about 6.8% word error rate on streaming real-world audio across domains like medical, finance, and call-center recordings, with latency under 300 milliseconds (Deepgram, 2026). It supports more than 45 languages and ships the features production transcription needs — speaker diarization (labelling who spoke each segment), smart formatting (turning "two thirty" into "2:30"), and keyterm prompting (telling the model in advance about unusual words like product names so it gets them right).
On price, Nova-3 streaming runs at about $0.0077 per minute of audio, which works out to roughly $0.46 per hour (Deepgram, 2026). Deepgram also offers $200 in free credits to new accounts, which is enough audio to build and test a real prototype before paying anything.
The Voice-Agent Problem: Knowing When To Speak
In 2026 Deepgram shipped a separate model, Flux, aimed squarely at voice agents — automated systems that hold a spoken conversation with a person. The hardest part of a voice agent is not transcribing the words; it is knowing the exact moment the human has finished their turn so the agent can respond without interrupting, and without leaving an awkward pause. Traditional systems guess this from silence alone — wait for half a second of quiet and assume the person is done — which fails when someone pauses mid-thought or trails off.
Flux folds end-of-turn detection directly into the speech model, using the meaning and rhythm of the speech rather than silence alone, and reports end-of-turn decisions in under 400 milliseconds (Deepgram, 2026). In April 2026 Deepgram extended Flux to ten languages with the ability to switch language mid-conversation (Deepgram, 2026). If you are building a voice agent rather than a captioning or note-taking feature, this turn-detection capability is often the deciding factor, and it is one the open-source Whisper loop does not give you out of the box.
Option 3 — AssemblyAI, Built For Voice Agents
AssemblyAI is the third dominant hosted option, and its streaming product, Universal-Streaming, launched in mid-2025 with voice agents as the explicit target (AssemblyAI, 2025). It makes one design choice that sets it apart and is worth understanding even if you never use it: its transcripts are immutable from the moment they arrive.
Recall the partial-versus-final trade from earlier. Most streaming systems emit mutable partials that keep rewriting themselves until a final lands, which forces the application developer to choose between acting on fast-but-unstable partials or waiting for slow-but-stable finals. AssemblyAI flips this: every word it emits is final the instant you receive it and will never be revised. It emits those immutable words within about 300 milliseconds, and the company reports it does so 41% faster at the median than Deepgram Nova-3 — 307 milliseconds versus 516 milliseconds for word emission — and nearly twice as fast at the 99th percentile (AssemblyAI, 2025). The benefit for a voice agent is concrete: the agent can start "thinking" about its response while the user is still talking, because the words it has already received are guaranteed not to change underneath it.
Universal-Streaming is priced at a flat $0.15 per hour of session time, with unlimited concurrent streams and no per-stream surcharge (AssemblyAI, 2025). Like Deepgram's Flux, it builds end-of-turn detection in natively, combining acoustic and semantic cues with traditional silence detection rather than relying on silence alone. The newer Universal-3 Pro Streaming tier adds real-time speaker diarization (for an extra $0.12 per hour), keyterm prompting, and 99-plus-language support, and the company reports about 6.3% mean word error rate across English domains for that tier (AssemblyAI, 2026). AssemblyAI also publishes drop-in integrations for the voice-orchestration platforms most teams use — LiveKit and Pipecat among them — so wiring it into a WebRTC call is a configuration task, not a research project. For how those real-time AI pieces fit inside a live video call, see the real-time AI in the WebRTC pipeline lesson; for fanning one transcript out to many viewers, see the live captions SFU fan-out lesson.
The Three Side By Side
The table below collects the published 2026 figures so you can compare on the axes that actually decide a deployment. Treat the latency and accuracy numbers as vendor-reported on each provider's own test sets — independent results vary by audio domain, and the only number that matters in the end is the one you measure on your own audio.
| Criterion | Whisper (self-hosted) | Deepgram Nova-3 | AssemblyAI Universal-Streaming |
|---|---|---|---|
| Type | Open-source model, you run it | Hosted streaming API | Hosted streaming API |
| License / hosting | MIT, your own GPUs | Vendor cloud | Vendor cloud |
| Streaming WER (vendor / paper) | ~2 pts above its batch WER | ~6.8% | ~6.3% (Universal-3 Pro) |
| Latency (word emission) | ~3.3 s avg (A40, 1 s chunk) | <300 ms | ~300 ms, immutable |
| Partials revised? | Yes (until LocalAgreement confirms) | Yes (mutable partials) | No (immutable from the start) |
| End-of-turn detection | Build it yourself | Native in Flux | Native |
| Languages | 99 | 45+ | 99+ (Pro) |
| Price (streaming) | GPU cost only | ~$0.46 / hr | $0.15 / hr (+$0.12 diarization) |
| Best when | Data must stay on-prem; huge scale | Lowest latency; broad features | Voice agents; immutable transcripts |
Sources: Macháček et al. (2023); Deepgram (2026); AssemblyAI (2025, 2026).
Worked Example — What A Live-Captioning Feature Costs
Numbers in the abstract do not help a budget decision, so put real ones through. Suppose you are adding live captions to a webinar product, and you expect 2,000 hours of streamed audio per month — say 1,000 hour-long webinars with two speakers, or any equivalent mix.
On AssemblyAI Universal-Streaming at $0.15 per hour, the monthly transcription bill is:
2,000 hours × $0.15/hour = $300 per month.
On Deepgram Nova-3 streaming at about $0.46 per hour, the same volume is:
2,000 hours × $0.46/hour = $920 per month.
Now the self-hosted Whisper path. The model is free, so the cost is the GPU. A single mid-range cloud GPU instance capable of running faster-whisper in real time costs on the order of $0.50 to $1.50 per hour to rent, and one such instance can handle several concurrent streams. But a webinar product does not stream evenly across the month — 2,000 hours might be concentrated into business-hours peaks that need many instances at once and none overnight. If you provision for a peak of, say, ten concurrent GPU instances at $1.00 per hour and they run an average of 200 hours each per month, that is:
10 instances × 200 hours × $1.00/hour = $2,000 per month — in raw GPU cost alone, before the salary of the engineer who keeps the fleet running.
This is the trap that catches most teams. At 2,000 hours a month, the hosted API is dramatically cheaper than self-hosting, because you only pay for audio actually processed and the provider absorbs the cost of idle capacity. Self-hosting wins on per-hour cost only at very high, very steady volume — independent analyses put the break-even somewhere around hundreds of thousands of minutes per month, and only once you count the engineering time to operate it. Below that line, "free model" is the most expensive option on the menu.
Common Mistake — Acting On Partials As If They Were Final
The most damaging mistake we see, and the one that is invisible in a demo, is treating a partial transcript as if it were a committed final. In a quick test, partials look great — words appear fast and usually correct. In production, a partial that gets revised after you have already acted on it produces a class of bug that is maddening to trace.
A voice agent that fires an action the moment a partial says "cancel my order" will cancel the order even when the user actually said "cancel my order — no wait, change the address instead", because the early partial was committed to before the correction arrived. A live-translation feature that translates each partial produces a jittering, self-correcting mess on screen. A medical-notes tool that saves a partial as the record can store a word the model later abandoned.
The fix is to know exactly which results your provider guarantees are stable and to act only on those. With AssemblyAI, every emitted word is immutable, so this bug cannot occur — that is the point of the design. With Deepgram and with a Whisper LocalAgreement loop, you must wait for the explicit final (Deepgram marks it; LocalAgreement confirms it after two agreeing passes) before taking any irreversible action, and use partials only to update a display that the user understands may still change. Build the partial-versus-final distinction into your data model on day one, not after the first confusing incident.
How To Choose — Four Questions
The decision comes down to four questions, in order.
First, must the audio stay on your own servers? If you are handling medical consultations, legal recordings, or anything under strict data-residency rules that forbid sending audio to a third-party cloud, self-hosted Whisper is often the only option that satisfies compliance, and the operational cost is the price of that compliance. If the audio can leave your servers, a hosted API will almost always be cheaper and faster to ship.
Second, are you building a voice agent or a transcript? A voice agent — anything that talks back to the user — lives or dies on end-of-turn detection, so a model that builds it in (Deepgram Flux or AssemblyAI Universal-Streaming) saves you the hardest engineering problem in the project. A captioning or note-taking feature that never has to "know when to speak" does not need turn detection and can use the simplest, cheapest option that hits its latency target.
Third, what is your latency budget? If words must appear within a few hundred milliseconds — for a responsive voice agent or live captions that feel instant — the hosted APIs at sub-300 milliseconds are the realistic choice; a self-hosted Whisper loop at multi-second latency will feel sluggish. If a two-to-three-second delay is acceptable — many captioning and post-call summarisation use cases tolerate it — Whisper is back on the table.
Fourth, what volume, and how steady? At low or spiky volume, hosted APIs win on cost because you pay only for what you use. At very high, steady volume, with an ML engineering team to run it, self-hosting can win — but only above the break-even line, and only if you have already answered the first three questions in its favour.
Figure 3. Four questions, asked in order, route most streaming-ASR projects to the right option.
Where Fora Soft Fits In
We build streaming-ASR features into the video products our clients ship — live captions for webinars and e-learning, real-time transcription for telemedicine consultations, voice-agent and meeting-assistant features for conferencing platforms, and searchable transcripts for OTT and surveillance archives. The recurring lesson across those projects is that the model choice is the easy part; the hard part is the integration: getting clean audio out of a WebRTC call, handling the partial-versus-final distinction without races, keeping latency inside the budget the product promised, and choosing the deployment that the client's compliance rules actually allow. We have shipped each of the three options described here, picked by the four questions above rather than by which model had the best benchmark that quarter.
What To Read Next
- WhisperX deep-dive — diarization and word-level timestamps
- Speaker diarization with Pyannote in production
- Real-time multilingual speech translation in calls
Talk To Us / See Our Work / Download
- Talk to a video engineer about adding live captions, transcription, or a voice agent to your product → /services/ai-software-development
- See our case studies in conferencing, telemedicine, e-learning, and OTT → /cases
- Download the Streaming ASR Provider Selection Worksheet (one page, printable) → Download the worksheet
References
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356, 2022.
https://arxiv.org/abs/2212.04356. Primary source for Whisper's architecture (Transformer encoder-decoder), 680,000-hour weak-supervision training set, 30-second mel-spectrogram input window, 99-language coverage, MIT license, and the hallucination failure mode. - Macháček, D., Dabre, R., Bojar, O. Turning Whisper into Real-Time Transcription System (Whisper-Streaming). arXiv:2307.14743, 2023.
https://arxiv.org/abs/2307.14743. Primary source for the LocalAgreement-2 streaming policy, the ~3.3 s average latency on English ESIC (NVIDIA A40, 1 s chunk), the faster-whisper / CTranslate2 backend (~4× speed-up), buffer trimming, the 200-word prompt, and VAD trade-offs. - Liu, D., Spanakis, G., Niehues, J. Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection. Interspeech 2020. The origin of the LocalAgreement policy that Whisper-Streaming adopts.
- Deepgram. Introducing Nova-3: Setting a New Standard for AI-Driven Speech-to-Text. deepgram.com/learn, 2025–2026, accessed 2026-05-31. Vendor source for Nova-3 streaming WER (~6.8%), sub-300 ms latency, 45+ languages, diarization, smart formatting, and keyterm prompting.
- Deepgram. Pricing — Scalable Speech-to-Text, Text-to-Speech & Voice Agent APIs. deepgram.com/pricing, accessed 2026-05-31. Vendor source for Nova-3 streaming price (~$0.0077/min, ~$0.46/hr) and the $200 free-credit offer.
- Deepgram. Introducing Flux: Conversational Speech Recognition and Introducing Flux Multilingual. deepgram.com/learn, 2025–2026, accessed 2026-05-31. Vendor source for model-integrated end-of-turn detection (<400 ms) and the April 2026 expansion to ten languages with mid-call switching.
- AssemblyAI. Introducing Universal-Streaming: Ultra-Fast, Ultra-Accurate Speech-to-Text for Voice Agents. assemblyai.com/blog, June 2, 2025, accessed 2026-05-31. Vendor source for immutable transcripts, ~300 ms word emission, the 41% faster-than-Nova-3 median (307 vs 516 ms), the $0.15/hr flat price with unlimited concurrency, and native intelligent endpointing.
- AssemblyAI. Introducing Universal-3 Pro Streaming and Pricing. assemblyai.com, 2026, accessed 2026-05-31. Vendor source for the Universal-3 Pro tier — ~6.3% mean English WER, real-time diarization (+$0.12/hr), keyterm prompting, and 99+ language support — plus LiveKit / Pipecat drop-in integrations.
- AssemblyAI. Top APIs and Models for Real-Time Speech Recognition and Transcription in 2026. assemblyai.com/blog, April 29, 2026, accessed 2026-05-31. Cross-vendor latency context used to frame the comparison table; where vendor self-reports conflict, the article presents them as vendor-reported rather than as settled fact.


