Published 2026-06-02 · 20 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If your product puts people in a live call — a webinar platform, a virtual classroom, a telemedicine consult, a town-hall broadcast — captions have moved from a nice extra to something users expect and, increasingly, something the law requires. The engineering question is not whether to add them but where the speech recognition runs, because that one choice decides your cloud bill, whether captions look the same for everyone in the room, and whether the feature even works on a cheap phone. This article is for the product manager, founder, or engineering lead who has to specify a captions feature, size its cost, and talk to engineers about it without getting lost in protocol acronyms. By the end you will understand the single architectural pattern that production conferencing systems converge on, why it is so much cheaper than the obvious alternative, and the accessibility deadline that turns "we should add captions" into "we need captions by a date." For the speech-recognition engines this pattern feeds, see our deep-dive on streaming ASR with Deepgram, Whisper, and AssemblyAI; this article is about the conferencing architecture around them.
What A Live Caption Actually Is
Start with the thing itself, because the word "captions" hides a useful distinction. A caption is a line of text, time-aligned to speech, that says what is being said right now and who is saying it. It differs from a subtitle in intent: subtitles assume you can hear and translate the words for you, while captions assume you cannot hear and so also note the speaker and meaningful sounds. For a live call the practical form is simple — a strip of text at the bottom of the video that updates as people talk, usually labelled with each speaker's name.
There is a standard format underneath most caption text on the web, and it is worth naming because it shapes how the data is shaped. It is called WebVTT — the Web Video Text Tracks format — a W3C specification whose basic unit is a "cue": a chunk of text with a start time, an end time, and the words to show. A cue looks about as plain as it sounds:
00:00:12.000 --> 00:00:15.500
<v Maria>Let's start with the budget.
That <v Maria> tag is the speaker label — knowing who spoke is part of a caption, not decoration. WebVTT was designed for pre-recorded files where every cue's timing is known in advance, so a live call does not ship a finished WebVTT file; it produces these cues on the fly and streams them. But the mental model holds: a live caption track is a sequence of timed, speaker-labelled cues, generated as the meeting happens.
The engine that turns speech into those cues is an Automatic Speech Recognition system, abbreviated ASR — software that listens to audio and writes down the words. Everything in this article is about where you put that engine in a multi-person call and how its output reaches everyone's screen.
The Accessibility Clock You Are Already On
Before the architecture, the reason there is urgency. Live captions are not only a courtesy to people who are deaf or hard of hearing; for many products they are a legal requirement with a date attached.
The reference standard is the Web Content Accessibility Guidelines, known as WCAG, maintained by the W3C. One of its success criteria, numbered 1.2.4 and titled "Captions (Live)," states plainly that captions must be provided for all live audio content in synchronized media. It sits at conformance Level AA — the middle of WCAG's three levels and the one most laws and contracts point to. If your product hosts live audio with video, Level AA means live captions, full stop.
What turned that guideline into a deadline is a 2024 rule from the US Department of Justice under the Americans with Disabilities Act. It requires state and local government bodies to meet WCAG 2.1 Level AA for their web content and mobile apps, and it carries hard dates: entities serving populations over fifty thousand must comply by April 24, 2026, and smaller ones by April 26, 2027. Any platform that sells to public universities, courts, city governments, or public schools inherits that deadline through its customers. For US television and certain internet video the Federal Communications Commission has required captioning for years; the DOJ rule extends the pressure squarely into the web-conferencing and webinar world.
The takeaway is not legal advice — get your own counsel — but a planning fact: if your buyers include public institutions or large enterprises, "captions" likely has a compliance date in 2026, and the architecture you pick has to be in place before it.
The SFU You Already Have
Now the architecture, starting with a piece you almost certainly already run. In a call with more than two or three people, the audio and video do not fly directly between everyone; that would force each person's device to send a separate copy to every other person, which collapses on phones and weak connections. Instead the streams pass through a server in the middle called a Selective Forwarding Unit, or SFU. Each participant sends one copy of their audio and video up to the SFU, and the SFU forwards — "fans out" — the right streams down to everyone else.
Picture a 30-person all-hands. Without the SFU, the speaker's phone would upload 29 copies of its audio. With the SFU, the phone uploads one copy, and the server sends that one copy out to the 29 listeners. That fan-out — one stream in, many streams out — is the SFU's whole job, and it is the reason group calls work at all. We cover the broader role of the SFU and the browser hooks around it in the WebRTC AI integration article.
Here is the insight the rest of this article rests on: the SFU is the one place in the system that already sees every participant's audio, separated by speaker, in real time. That makes it the natural place to attach speech recognition. You are not adding a new pipe; you are tapping a pipe that already carries exactly what the ASR engine needs.
Two Places To Put The Ear: Client-Side Versus Server-Side
There are exactly two places you can run the speech recognition, and choosing between them is the real decision. The difference is easiest to see by counting.
The first option is client-side: every participant's own device transcribes the audio it hears. This feels natural — the audio is already arriving at each device to be played — and it keeps voices off your servers, which sounds private. But count the cost. In a 30-person meeting, if every device transcribes the call, you are running thirty separate speech-recognition jobs for one conversation. Each device burns battery and processor time; a cheap phone may not manage it at all and will drop frames or overheat. Worse, the thirty devices produce thirty slightly different transcripts, because each ran its own engine on its own slightly different received audio — so two people in the same meeting see different captions, and your recording, if you keep one, matches none of them.
The second option is server-side: the SFU taps each speaker's audio once, runs the recognition centrally, and sends the resulting text to everyone. Now count again. The same 30-person meeting needs recognition only for the people actually talking — usually one, sometimes two. You run one or two recognition jobs instead of thirty. Every participant sees the exact same caption because there is one source of truth. The text is tiny to distribute. And it works identically on a flagship laptop and a three-year-old phone, because the phone only has to display text, not generate it.
Stated that way the choice looks lopsided, and for group calls it is. Client-side recognition has a real niche — a one-to-one call, an offline or privacy-locked product where audio must never leave the device, or as a fallback when the server path is down — and we cover that path in its own article on client-side ASR with faster-whisper in the browser. But for any meeting with a crowd, server-side recognition is the pattern that scales, and the SFU is where it lives.
Figure 1. The core decision: every device transcribing on its own versus one transcription at the SFU, fanned out to all. For group calls the right panel wins on cost, consistency, and device compatibility.
The Fan-Out Pattern, Step By Step
Put the pieces together and you get the pattern this article is named for. It reuses the SFU's existing job — fanning out media — and adds a second, much smaller fan-out: text.
Walk the path of a sentence. Maria speaks. Her device sends one audio stream up to the SFU, exactly as it already does so the others can hear her. A server-side process subscribes to that stream — it is just another consumer of the audio the SFU is already forwarding — and pipes Maria's audio into a streaming ASR engine. The engine returns text. That text is packaged as caption cues and handed back to the SFU, which fans it out to all 30 participants over a side channel meant for data rather than media. Each participant's app receives the cue and draws it on screen under Maria's video.
Two fan-outs, then, riding the same server: audio fans out so people can hear, and caption text fans out so people can read. The text fan-out is almost free compared with the audio one — a line of caption is a few dozen bytes, where a second of audio is thousands — so adding captions to a call that already runs through an SFU is mostly a matter of wiring, not new infrastructure.
One refinement makes the pattern efficient instead of wasteful, and it is the hinge of the whole design: you do not transcribe every audio stream all the time. You transcribe only streams that currently contain speech. A small, cheap detector called Voice Activity Detection — VAD, software that answers the yes/no question "is anyone speaking in this audio right now?" — gates the expensive ASR. When Maria talks, her stream gets a recognition job; when she goes quiet, the job stops. Because in a real meeting only one or two people talk at once, VAD-gating means you pay for one or two recognition jobs no matter how many people are in the room.
Figure 2. The fan-out pattern: the SFU forwards audio as usual, a VAD gate sends only active speech to one ASR engine, and the resulting caption text is fanned back out to every participant.
The Cost Math, Out Loud
The reason the pattern matters commercially is a number you can compute on one line, so let us compute it. Streaming speech recognition is billed by the minute of audio processed. A representative 2026 rate is Deepgram's Nova-3 streaming tier at about $0.0077 per minute, which is roughly $0.46 per hour of audio.
Take a one-hour, 30-person meeting. The naive approach — transcribe every participant's track the whole time — runs 30 streams for 60 minutes:
naive cost = 30 streams × 60 min × $0.0077/min
naive cost = 1,800 stream-minutes × $0.0077
naive cost = $13.86 for one meeting
Now the fan-out pattern with VAD gating. In an hour-long meeting, suppose there is on average one active speaker at a time, occasionally two — call it 1.5 streams of actual speech:
fan-out cost = 1.5 streams × 60 min × $0.0077/min
fan-out cost = 90 stream-minutes × $0.0077
fan-out cost = $0.69 for one meeting
That is the whole argument in two sums: $13.86 against $0.69, a twentyfold difference, for captions that are also more consistent and work on weaker devices. Scale it to a product running ten thousand such meetings a month and the gap is roughly $138,600 versus $6,900 — the difference between a feature that quietly bleeds money and one that barely registers on the bill. The lever is VAD gating, and forgetting it is the most expensive mistake in this whole topic.
Partial Versus Final: The Caption That Rewrites Itself
There is a behaviour of live captions that confuses people the first time they see it, and understanding it prevents a class of bugs. Watch a live caption closely and you will see the words appear, then change, then settle. "I think we should" becomes "I think we should ship" becomes "I think we should ship it Friday." The caption is rewriting itself in place. This is not a glitch; it is how streaming recognition works, and your design has to expect it.
Streaming ASR engines emit two kinds of results. The first is a partial — also called an interim result — a fast, rough guess at what is being said, sent while the person is still talking. It is marked as not final, and it will very likely change as more audio arrives and the engine reconsiders. The second is a final result, sent when the engine is confident a phrase is complete and the text will not change again. The engine decides a phrase is complete using endpointing — detecting the small silence or drop in speech that marks the end of an utterance, the same VAD idea applied to find boundaries rather than just presence.
The trade-off is responsiveness against stability. Show only finals and captions feel laggy, appearing a second or two after each phrase ends. Show partials and captions feel instant but flicker as they correct themselves. Production systems show partials so the caption feels live, then lock each line when its final arrives — the words you just read stop moving and the next rough guess begins below. Your client code must therefore treat an incoming caption as replace the current in-progress line, not append a new line, until the final arrives. Teams that append every partial end up with captions that stutter and repeat; teams that replace-until-final get the clean, settling behaviour users expect. We go deeper on interim-versus-final mechanics in the streaming ASR article.
Figure 3. A caption's life: fast interim guesses that update in place, then a locked final line once endpointing detects the end of the phrase.
Delivering The Text: The Data Channel
The captions are generated; now they have to reach 30 screens. They do not travel in the audio or video — those are continuous media streams with no room for arbitrary text. They travel on a separate path built into WebRTC for exactly this kind of payload: the data channel.
The data channel is the standardized way for browsers in a WebRTC session to send arbitrary data — text or bytes — to each other, defined in the IETF specification RFC 8831, "WebRTC Data Channels." It is worth knowing one design choice it offers, because it maps neatly onto captions. The channel can be configured as reliable and ordered, meaning every message is guaranteed to arrive and to arrive in the sequence it was sent, or as partially reliable and unordered, meaning the system may drop or reorder messages to favour speed. The underlying transport, called SCTP, makes reliability and ordering a property of each message rather than the whole connection, so a product can even mix the two.
For captions the usual choice is reliable and ordered: a missed or out-of-order caption line is jarring, and the text volume is so small that guaranteeing delivery costs almost nothing. The one case for relaxing it is partial results — since each partial will be replaced moments later by the next partial or the final, dropping one mid-stream is harmless, so some systems send partials best-effort and finals reliably. Either way, the caption cue — text, speaker label, and timing, the same fields a WebVTT cue carries — is serialized, sent once from the SFU into the data channel, and fanned out to every participant, who deserializes it and renders the line.
This is also where conferencing frameworks hide the plumbing. LiveKit, for instance, exposes transcriptions as a first-class stream that its server publishes to clients and its client SDKs surface as ready-to-render text events, so application code subscribes to captions rather than parsing the data channel by hand. Open-source stacks do the same in their own idiom — Jitsi's server-side transcriber, Jigasi, sends speech-to-text results back into the meeting as JSON that the Jitsi Meet client turns into on-screen subtitles. The transport underneath is the data channel; the framework just gives it a friendlier name.
The Latency Budget For A Caption
Users forgive captions that trail speech by a moment; they do not forgive captions that lag by many seconds. So it helps to add up where a caption's delay comes from, because the total is a budget you design against — the same discipline we apply to the whole call in the sub-100-millisecond latency budget article, here applied to text.
A caption's journey has four legs. First, the speaker's audio travels up to the SFU — tens of milliseconds on a decent network. Second, the ASR engine listens and emits a partial; this is the big one, typically a few hundred milliseconds for a fast streaming engine to produce its first guess. Third, the caption text travels from the SFU down to the participants over the data channel — again tens of milliseconds, because text is tiny. Fourth, the client renders it. Add the legs:
caption delay ≈ 80 ms (audio up) + 300 ms (ASR partial) + 60 ms (text down) + render
caption delay ≈ under half a second for the first partial guess
So a well-built captions feature shows a rough line within roughly half a second of someone speaking, then settles it into a final a second or two later as the phrase completes. The dominant term is the ASR engine — which is why the streaming engine's first-partial latency, not the network, is the number to compare when choosing a vendor. Anything that adds a second on top, such as routing audio through an extra mixing step before recognition or waiting for finals before showing anything, is felt immediately by users and should be challenged in design review.
Client-Side Versus Server-Side: The Decision Table
The two approaches line up cleanly against the criteria that matter when you specify the feature.
| Criterion | Client-side ASR (each device) | Server-side ASR fan-out (at the SFU) |
|---|---|---|
| Recognition jobs per 30-person call | Up to 30 (one per device) | 1–2 (only active speakers, VAD-gated) |
| Caption consistency | Each device differs | One source of truth, identical for all |
| Works on a cheap phone | Often not — drains battery, may fail | Yes — phone only displays text |
| Cloud cost | None (runs on devices) | Low and bounded by speakers, not listeners |
| Audio leaves the device | No — privacy-preserving | Yes — audio reaches your server |
| Recording / search / translation | Hard — no central transcript | Easy — central transcript already exists |
| Best fit | 1:1 calls, offline / privacy-locked, fallback | Group calls, webinars, anything recorded |
Read it by your product. A group webinar or classroom that you also record and want to translate is squarely server-side: you get one transcript that captions, archives, and feeds translation all at once. A privacy-locked one-to-one product where audio must never touch a server is client-side, accepting the higher device cost. Most conferencing products are the former, which is why the fan-out pattern is the default.
A Common Mistake: Transcribing Every Track, Always
The single most expensive error in this area is the one the cost math already hinted at: running speech recognition on every participant's audio stream continuously, instead of only on streams that currently carry speech. It is an easy mistake because it is the simplest thing to build — subscribe to all tracks, pipe them all to the ASR engine, done — and it works perfectly in a two-person test where both people are the active speaker.
Then it ships to a 30-person all-hands and the bill is twentyfold what it should be, as the math above showed, for no benefit — the 28 silent participants generate empty transcripts that cost full price. The fix is not a faster engine or a cheaper vendor; it is the VAD gate. Detect speech first, transcribe second. A second, related slip is leaving recognition running through long silences on a single stream; the same gate solves it. Treat "is anyone actually talking on this stream right now" as a question you must answer cheaply before you spend money answering "what are they saying," and the cost problem disappears.
Build With A Framework, Buy An API, Or Both
Once the pattern is settled, the practical choice is what you assemble it from. Three layers each have a build-or-buy decision, and they are independent.
The media server is the first. You can run an open-source SFU — mediasoup, Janus, Jitsi — and own the audio tap yourself, or use a hosted real-time platform such as LiveKit that gives you the SFU plus a built-in transcription stream so captions are closer to a configuration than a project. Self-hosting trades engineering time for control and lower per-minute cost; the hosted route trades a fee for speed to launch.
The recognition engine is the second, and here you are almost always buying an API. Streaming ASR vendors differ on the axes that matter for live captions — first-partial latency, accuracy on noisy real-world audio, language coverage, and price. Deepgram leads on low latency and commodity pricing, AssemblyAI and Speechmatics are strong on accuracy over messy audio, and self-hosted Whisper or NVIDIA's speech stack is the route when audio cannot leave your infrastructure or per-minute fees must go to zero at scale. We compare them in detail in the streaming ASR article; the conferencing pattern in this article is the same whichever engine you pick.
The third is everything around the raw transcript: labelling who spoke, which is speaker diarization when speakers are not already separated by track; cleaning the audio first with noise suppression so the engine hears words instead of keyboard clatter; and, if you serve multiple languages, feeding the same transcript into real-time translation. The fan-out pattern is the spine; these are the muscles you add to it.
Where Fora Soft Fits In
We build the live-video products where captions are now table stakes — video conferencing platforms, virtual classrooms and e-learning, telemedicine consultations, and live broadcast and webinar tools — and we build them on the SFU-side fan-out pattern this article describes because it is the version that survives contact with a real customer's bill. The discipline we apply is the one argued here: transcribe at the media server you already run, gate recognition with voice-activity detection so cost tracks speakers rather than seats, treat partials as in-progress lines that lock on final, and deliver the text on a reliable data channel so every participant reads the same caption. Because we work in education and healthcare, we also plan for the accessibility deadline from the start rather than bolting captions on later, and we keep the central transcript ready to feed recording, search, and translation, since a customer who asked for captions this quarter usually asks for those next.
What To Read Next
- Streaming ASR in production — Deepgram, Whisper, AssemblyAI
- WebRTC + AI: Insertable Streams, Encoded Transform, and the SFU
- Real-time multilingual speech translation in calls
Talk To Us / See Our Work / Download
- Talk to a video engineer about adding live captions to your WebRTC product → /services/webrtc-development
- See our case studies in conferencing, e-learning, telemedicine, and live broadcast → /cases
- Download the Live Captions Engineering Cheat Sheet (one page, printable) → Download the cheat sheet
References
- W3C. WebVTT: The Web Video Text Tracks Format, W3C Candidate Recommendation, 4 April 2019, accessed 2026-06-02.
https://www.w3.org/TR/webvtt1/. Primary standards source for the caption-text format: the cue model (start time, end time, payload), the<v>voice/speaker tag, and WebVTT's design for marking up time-aligned text tracks. Cited as the de-facto shape of caption data even though a live call streams cues rather than shipping a finished file. - W3C. Web Content Accessibility Guidelines (WCAG) 2.2 — Success Criterion 1.2.4 Captions (Live), Level AA, W3C Recommendation, 5 October 2023, accessed 2026-06-02.
https://www.w3.org/TR/WCAG22/#captions-live. Primary standards source for the requirement that captions be provided for all live audio content in synchronized media at conformance Level AA. SC 1.2.4 is unchanged from WCAG 2.0/2.1, so the requirement predates and underpins the DOJ rule. - IETF. RFC 8831 — WebRTC Data Channels, Standards Track, January 2021, accessed 2026-06-02.
https://www.rfc-editor.org/rfc/rfc8831. Primary standards source for the caption-delivery transport: data channels over SCTP, ordered vs unordered and reliable vs partially-reliable delivery as per-message properties, and the WebRTC String/Binary payload protocol identifiers. Establishes why captions are usually sent reliable-and-ordered and why partials may be sent best-effort. - IETF. RFC 8832 — WebRTC Data Channel Establishment Protocol (DCEP), Standards Track, January 2021, accessed 2026-06-02.
https://www.rfc-editor.org/rfc/rfc8832. Companion standards source defining how a data channel is opened and negotiated between peers — the channel the caption fan-out runs over. - US Department of Justice, Civil Rights Division. Fact Sheet: Final Rule on the Accessibility of Web Content and Mobile Apps Provided by State and Local Governments (ADA Title II), 28 CFR Part 35, 2024, accessed 2026-06-02.
https://www.ada.gov/resources/2024-03-08-web-rule/. Primary legal source for the WCAG 2.1 Level AA compliance requirement and the compliance dates — 24 April 2026 for public entities serving populations of 50,000+ and 26 April 2027 for smaller entities. Defines why "captions" carries a 2026 deadline for products selling to US public institutions. - Deepgram. Pricing and Streaming Speech-to-Text documentation, accessed 2026-06-02.
https://deepgram.com/pricing. Vendor source for the representative 2026 streaming rate (Nova-3 streaming ≈ $0.0077/min ≈ $0.46/hr) used in the cost example, and for the streaming API's interim/final result model and sub-300 ms latency claims. Rate used as an illustrative commodity benchmark, attributed to Deepgram, not asserted as a universal price. - Deepgram. Interim Results and Endpointing and Interim Results developer documentation, accessed 2026-06-02.
https://developers.deepgram.com/docs/interim-results. Vendor engineering source for the partial-vs-final mechanics: interim results markedis_final: false, finalized segments markedis_final: true, and endpointing/VAD as the means of detecting utterance boundaries. Used for the "caption that rewrites itself" section. - LiveKit. Transcriptions and Text and transcriptions documentation, accessed 2026-06-02.
https://docs.livekit.io/agents/voice-agent/transcriptions/. First-party platform source for a production fan-out implementation: a server-side agent subscribes to a participant's audio track, runs STT, and the framework forwards transcription segments (associated with a Participant and Track) to clients in real time over the session — the data-channel plumbing wrapped as a transcription stream. - Jitsi (8x8). Jigasi — server-side transcription gateway (README and transcription configuration), accessed 2026-06-02.
https://github.com/jitsi/jigasi. First-party open-source source for the server-side transcriber pattern: Jigasi sends each participant's audio to an external speech-to-text service (Google Cloud, Vosk, a Whisper flavour, Oracle Cloud) and returns results to Jitsi Meet as JSON that the client renders as on-screen subtitles, or as plain text posted to chat. - AssemblyAI. Best APIs and models for real-time speech recognition (2026) and Benchmarks, accessed 2026-06-02.
https://www.assemblyai.com/blog/best-api-models-for-real-time-speech-recognition-and-transcription. Vendor source for the streaming-ASR comparison axes (first-partial latency, real-world WER, language coverage) cited in the build-vs-buy section; vendor claims attributed, cross-checked against the FutureAGI and CodeSOTA 2026 comparisons below. - FutureAGI. Speech-to-Text APIs in 2026: Benchmarks, Pricing, and a Developer's Decision Guide, 2026, accessed 2026-06-02.
https://futureagi.com/blog/speech-to-text-apis-in-2026-benchmarks-pricing-developer-s-decision-guide/. Independent 2026 comparison source for streaming latency and WER figures across Deepgram, AssemblyAI, Whisper, and ElevenLabs, used to keep the vendor characterization current and balanced rather than relying on any single vendor's self-report. - Fishjam (Software Mansion). Real-Time Audio Transcription API: How to Turn Speech to Text During Live Conferencing, accessed 2026-06-02.
https://fishjam.swmansion.com/blog/real-time-audio-transcription-api-how-to-turn-speech-to-text-during-live-conferencing-f77e2ff3f4de. Engineering source for the server-side tap pattern (SFU as the central audio router, server-side consumer piping per-participant audio into a streaming ASR) and the PCM 16 kHz audio-format expectation of most ASR engines.


