Published 2026-06-02 · 20 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
Language is the last hard barrier in live video. If your product puts people in a call — a global webinar, a tele-medicine consult with a patient who speaks another language, a virtual classroom, a customer-support line — translation decides whether half your potential users can take part at all. The engineering questions are concrete and they all have wrong answers that are expensive: where does the translation run, how much delay will users tolerate, do listeners hear a synthesized voice or read captions, and what is the bill per minute per language. This article is for the product manager, founder, or engineering lead who has to scope a translation feature, set a realistic latency target, and talk to engineers without drowning in acronyms. It uses consumer translation earbuds as the on-ramp — because everyone has now seen one work — and then shows the exact same machinery inside a WebRTC call. For a vendor-by-vendor procurement view of the engines this pipeline feeds, see our companion comparison of real-time speech translation vendors; this article is about the architecture around them.
What "Real-Time Speech Translation" Actually Means
Start with the thing itself, because "translation" hides four separate jobs that happen in sequence, and almost every design decision later is about how you arrange them.
Picture someone speaking Spanish into a microphone and you, an English speaker, needing to follow along live. Four things have to happen. First, the system has to hear the speech as a clean audio signal. Second, it has to write down the words that were spoken — turning the sound of Spanish into Spanish text. The software that does this is called Automatic Speech Recognition, abbreviated ASR; it is the same "speech-to-text" engine that powers dictation. Third, it has to translate that text from Spanish into English, the job of Machine Translation, abbreviated MT — the same kind of engine behind a web translator, but tuned for speed. Fourth, it has to deliver the English back to you, either as text on a screen or as a spoken voice produced by Text-to-Speech, abbreviated TTS — software that turns written words into natural-sounding audio.
So the classic shape is a chain: audio → ASR → MT → TTS → audio. Engineers call this a cascade, because each stage pours its output into the next like water down a series of steps. Hold that picture; it is the backbone of the whole article.
There is a second, newer shape worth naming now because it changes the trade-offs. Instead of three separate engines, a single model can take in speech in one language and emit speech in another directly, without ever writing the words down in between. This is called direct speech-to-speech translation, or S2ST, and Meta's Seamless models, OpenAI's Realtime API, and Google's Gemini Live are the 2026 examples. We will compare the two shapes head to head once the pieces are on the table. For now, keep both in mind: the cascade is three engines in a row; the direct model is one engine that does the whole hop.
Your Earbuds Are A WebRTC Call You Can't See
The phrase people search for is "real time translation earbuds," so let us start there — and then pull the curtain back, because the earbud is the easiest way to understand the call.
When you wear translation earbuds and the person across the table speaks, here is what actually happens. The earbud's microphone captures their voice and sends it to a more powerful computer — your phone, or a server the phone talks to. That computer runs the exact four-step pipeline above: ASR writes down what was said, MT translates it, TTS speaks the result, and the translated voice plays back into your ear. The earbud did none of the translating. It is a microphone and a speaker — the ears and mouth — wired to a brain that lives somewhere else. The "magic earbud" is a marketing frame; the engineering reality is a tiny audio endpoint in front of a translation pipeline.
That single fact is why earbuds and video calls are the same problem. A WebRTC call — WebRTC being the standard technology that lets browsers and apps send live audio and video to each other — is also just audio endpoints in front of pipelines. Swap the earbud's microphone for a participant's laptop microphone, swap the earbud's speaker for the other participants' headphones, and the translation machinery in the middle does not change at all.
Where 2026's earbuds genuinely differ from each other is where the brain sits, and that is a real engineering choice you will face too. Apple's Live Translation, introduced with iOS 26 for AirPods Pro 2 and later, downloads the language models to your iPhone and runs the whole pipeline on the device, so the conversation never leaves your phone — private, but limited to the languages and compute the phone can hold. Google's Pixel and Samsung's Galaxy systems lean on a phone-plus-cloud split, with Google's running on Gemini models to reach 70-plus languages while preserving the speaker's tone. The trade is the oldest one in this field: on-device is private and works offline but is capped by the hardware; cloud is more capable and multilingual but sends audio off the device and needs a network. You meet the very same fork when you decide where the pipeline runs in your own product — on the client, or on the server.
Figure 1. The translation earbud is an audio endpoint, not a translator. The same pipeline sits behind every WebRTC participant; the only real choice is where the "brain" runs.
Cascade Versus Direct: Two Ways To Build The Brain
Now open the brain and look at the two shapes, because choosing between them is the first architecture decision and it ripples through everything else.
The cascade wires three specialist engines in a row: ASR writes the words, MT translates them, TTS speaks them. Its strengths come from that separation. You can pick the best speech-recognizer for your audio, the best translator for your language pair, and the best voice for your brand, and swap any one of them without touching the others. You can read the text between the stages, which means you can log it, show captions for free, run it past a glossary of your product's proper nouns, and see exactly which stage made a mistake when something goes wrong. That debuggability is why, as our vendor comparison puts it, cascades win procurement even when end-to-end wins demos.
The cascade's weaknesses also come from the separation. Errors compound: if ASR mishears a word, MT faithfully translates the wrong word, and TTS confidently speaks it. Each stage adds its own delay, so the total lag is the sum of three engines plus the network between them. And the original speaker's emotion — their pace, their emphasis, the rise of a question — is thrown away the moment speech becomes plain text, so the synthesized voice at the end sounds flat unless you work to restore it.
The direct model, speech-to-speech in one pass, attacks exactly those weaknesses. With no text stage in the middle, there is one engine's worth of delay instead of three, fewer places for errors to compound, and — in the expressive variants like Meta's SeamlessExpressive — the ability to carry the speaker's prosody and even voice through to the output. The cost is the loss of the seam you could inspect. There is no clean intermediate text to log, caption, or correct with a glossary, which makes the direct model harder to audit and to comply with in regulated settings like healthcare. The two approaches line up like this:
| Criterion | Cascade (ASR → MT → TTS) | Direct speech-to-speech (S2ST) |
|---|---|---|
| Latency | Three engines in series, more total delay | One pass, lower delay |
| Error behaviour | Errors compound stage to stage | Fewer hand-offs, fewer compounding errors |
| Captions for free | Yes — text already exists mid-pipeline | No native text; must add a parallel ASR |
| Glossary / proper nouns | Easy — correct the text stage | Hard — no text to correct |
| Voice & emotion | Lost at the text stage unless restored | Preserved in expressive models |
| Debug & audit | Easy — inspect each stage | Hard — one opaque hop |
| Best fit | Regulated, multi-language, caption-heavy | Latency-critical, natural-voice, 1:1 |
Read it by your product. A tele-medicine platform that must keep a written record and pass an audit wants the cascade. A two-person consumer call that should feel like the other person is speaking your language wants the direct model. Many production systems hedge: a cascade as the workhorse for its captions and control, with a direct model on the pairs where natural voice matters most. We go deeper on the engines themselves — Seamless, the OpenAI Realtime API, Gemini Live — in the speech-to-speech models article.
Figure 2. The cascade exposes a text seam you can caption and correct; the direct model hides it to cut delay and keep the voice. Most products use both.
The Translation That Rewrites Itself
There is a behaviour of live translation that surprises people the first time they watch it, and understanding it prevents a whole class of bugs. Watch a live translated caption and you will see it appear, then change, then change again, then settle. This is not a glitch. It is the unavoidable consequence of translating speech before the sentence is finished, and your design has to expect it.
The root cause is that languages put words in different orders. German, for instance, often holds the main verb until the very end of the sentence. If your system translates the first half of a German sentence into English the instant it hears it, it is guessing what the verb will be — and when the real verb finally arrives, the early English may have to be rewritten. A human simultaneous interpreter feels the same pressure and solves it by waiting: they stay a few seconds behind the speaker, holding the start of a sentence in mind until enough has arrived to translate it safely. Interpreters call that gap the ear-voice span, and research puts the workable range at roughly two to six seconds, with about three seconds the sweet spot. Wait too little and you translate the wrong thing; wait too much and the conversation stalls.
Software faces the identical trade-off, and it has a name: the read-write policy. At every moment the system decides whether to read — wait for more incoming speech — or write — commit to translating what it has so far. The simplest rule, called wait-k, is to stay a fixed number of words behind the speaker, exactly like the interpreter's three-second lag. Smarter, adaptive policies wait longer when the sentence is ambiguous and commit sooner when it is clear.
This is why your client code must treat an incoming translation the way live-caption code treats an interim guess: as a line that replaces the current in-progress line, not one that appends below it, until the system marks the line final. Teams that append every revision get captions that stutter and repeat; teams that replace-until-final get the clean, settling behaviour users expect. If you are also building the speech path, the same instability has a sharper edge — once TTS has spoken a translated phrase aloud, you cannot un-speak it, so the speech path has to wait for more stable input than the caption path before it commits. The discipline is the same one we describe for interim-versus-final results in live captions; translation just raises the stakes because word order can rewrite more of the line.
Figure 3. Different word orders force early translations to be provisional. The system stays a few seconds behind — like a human interpreter — and locks the line only when enough has arrived.
The Latency Budget, Out Loud
Users forgive a translation that trails the speaker by a moment; they do not forgive one that lags by many seconds, because past a point they start talking over it and the conversation collapses. So latency is a budget you design against, and it pays to add up where the delay comes from. Let us do the arithmetic for a cascade, the slower of the two shapes.
A translated phrase passes through six legs. The speaker's audio travels to wherever the pipeline runs (tens of milliseconds on a decent network). ASR listens and emits its first words. MT translates them. TTS produces the first chunk of spoken audio. That audio travels back to the listener. Finally the listener's device plays it. Plugging in representative 2026 numbers for a streaming cascade:
translation delay ≈ 80 ms (audio to server)
+ 300 ms (ASR first partial)
+ 150 ms (MT)
+ 400 ms (TTS first chunk)
+ 60 ms (audio back)
+ intentional read-write lag
translation delay ≈ 1.0 s of machinery, plus the ~2-3 s you deliberately wait
Two lessons fall out of that sum. First, the deliberate wait — the ear-voice span you build in so word order can settle — is usually larger than all the machine delay combined. The engineering goal is not zero latency; it is honest, stable latency that mirrors how a human interpreter works. Second, the biggest machine term is TTS, because speaking takes time; one published 2026 analysis attributes a majority of cascade compute to the TTS stage, and the highest-value fix is to stream the spoken output chunk by chunk instead of waiting for a whole sentence.
It helps to anchor those numbers against the thresholds the field already agreed on. The standard for ordinary voice quality, ITU-T Recommendation G.114, puts the comfortable one-way mouth-to-ear delay for untranslated conversation below 150 milliseconds. The standard for professional remote simultaneous interpreting, from the interpreters' association AIIC, treats three to five seconds end-to-end as the sustainable upper bound. Between those poles, the practical rule that vendors and our own measurements converge on is simple: under about 800 milliseconds to the first translated word feels live; between 800 milliseconds and 1.5 seconds works for a lecture or keynote; past two seconds, listeners begin to talk over the translation. We treat the whole-call version of this discipline in the sub-100-millisecond latency budget article; here it is applied to the translated word.
Figure 4. The machine delay is about a second; the deliberate interpreter-style wait is usually larger. Design against the published thresholds, and stream the TTS output because it dominates the machine budget.
Putting The Pipeline In A WebRTC Call
Everything so far works for one listener. A group call adds the interesting part, and it reuses a piece you almost certainly already run. In a call with more than a couple of people, the audio does not fly directly between everyone; it passes through a media server in the middle called a Selective Forwarding Unit, or SFU, which receives one audio stream from each participant and forwards the right streams to everyone else. We explain that server and the browser hooks around it in the WebRTC AI integration article.
The SFU is the one place that already sees every speaker's audio, separated by person, in real time — which makes it the natural home for the translation pipeline, exactly as it is the natural home for live captions. You tap a speaker's audio once at the SFU, run it through the pipeline once, and then fan the result out. The crucial twist for translation is that a group is rarely all one language. So the fan-out is per listener language: from one speaker's words you produce English for the English listeners, Japanese for the Japanese listeners, Arabic for the Arabic listeners, each translated once and delivered to the group that needs it.
That output reaches listeners in one of two forms, and you will often offer both. The first is translated captions — text, delivered over the WebRTC data channel, the standard side-path for sending arbitrary data between participants defined in IETF RFC 8831. Captions are cheap to send, easy to make consistent, and let each person silently read along in their language. The second is translated audio — a synthesized voice produced by TTS and injected back into the call as a new audio track, carried by the same Opus codec (IETF RFC 6716) that already moves the call's voice. Audio feels more natural but raises a problem captions never have: the listener can now hear two voices, the original speaker and the translation, talking at once. The fix is to duck the original — automatically lower its volume while the translated voice plays — so the translation sits on top and the original stays as a quiet presence underneath, the way film dubbing leaves a trace of the source actor.
One more refinement keeps the whole thing affordable, and it is the same lever that governs captions: you do not translate every audio stream all the time. A cheap detector called Voice Activity Detection — software that answers "is anyone speaking in this stream right now?" — gates the expensive pipeline so it runs only on people who are actually talking. Because in a real meeting only one or two people speak at once, the cost tracks active speakers, not the number of seats. We cover that gating in detail in the live-captions fan-out article; it matters even more here, because a translation pipeline costs more per minute than plain transcription.
Figure 5. In a group call the pipeline lives at the SFU. Each speaker is translated once and fanned out per listener language, as captions on the data channel and optional ducked audio on an injected Opus track.
The Cost Math, Out Loud
The reason the architecture matters commercially is a number you can compute on one line, so let us compute it. Translation pipelines are billed per minute of audio processed, and — this is the trap — usually per language pair, not per session. A streaming cascade in 2026 lands in the rough range of $0.08 to $0.20 per minute of source audio at moderate volume; a heavily optimized self-hosted stack can reach a few cents.
Take a one-hour all-hands with one active speaker presenting, streamed to listeners in three other languages. The work is one translation pipeline per target language, running only while someone speaks. With voice-activity gating keeping the active-speech time to roughly the full hour for a single presenter, and a representative $0.12 per minute:
cost = 3 target languages × 60 min × $0.12/min
cost = 180 language-minutes × $0.12
cost = $21.60 for the one-hour event
Now notice the two levers. Add a fourth and fifth language and the bill rises in a straight line — translation cost scales with the number of target languages, which is why products cap the languages on offer rather than translating into all of them speculatively. And forget the voice-activity gate — translate all thirty participants' tracks continuously instead of the one or two actually speaking — and you multiply the bill by the number of seats for no benefit, since twenty-eight silent people produce empty translations at full price. The discipline that keeps the number sane is the same as for captions: detect speech first, translate second, and translate only into languages a real listener has selected.
A Common Mistake: Committing The Translation Too Early
The single most damaging error in this area is not about cost; it is about trust, and it comes from ignoring the rewrite-itself behaviour. The tempting design is to translate and speak every interim guess the instant ASR produces it, because it feels maximally responsive. In a two-person English-to-Spanish test where the sentences are short and the word order is similar, it even looks fine.
Then it meets a real sentence in a language that reorders words, and it falls apart. The system speaks "I have the package" aloud, then the German verb arrives and the correct sentence turns out to be "I sent the package back" — but the TTS voice already committed to the wrong meaning, and you cannot un-speak it. The listener hears a confident, fluent, wrong translation, which is worse than a slightly slower correct one, because they have no way to know it was provisional. The fix is the read-write discipline from earlier: make the speech path wait for a stable, final segment before it speaks, even though that costs a second or two, and reserve the fast interim updates for the caption path, where replacing an in-progress line is invisible and harmless. Responsiveness is not the goal; trustworthy responsiveness is. A translation users cannot rely on is worse than no translation, because it fails silently.
Build With A Framework, Buy An API, Or Both
Once the pattern is settled, the practical question is what you assemble it from, and the layers are independent.
The media layer is the call itself. You can run an open-source SFU — mediasoup, Janus, LiveKit's server — and own the audio tap and the track injection yourself, or use a hosted real-time platform that exposes participants' audio and lets you publish new tracks back with less plumbing. LiveKit in particular treats a server-side translation agent as a first-class citizen, which is why it shows up repeatedly in production translation builds.
The pipeline layer is the translation engine, and here you are almost always buying or self-hosting a model rather than writing one. The cascade option means choosing a streaming ASR engine, a machine-translation engine, and a streaming TTS voice, each swappable. The direct option means a single speech-to-speech model — Meta's self-hostable Seamless for data residency and zero per-minute fees at scale, or the OpenAI Realtime and Gemini Live APIs for a managed path, the latter notably cheap per minute in 2026. The honest decision rule from our vendor comparison is volume-driven: below roughly sixty concurrent streams the cloud APIs win on total cost; far above that, self-hosting a model pays back, usually decided by compliance rather than dollars.
The support layer is everything around the raw pipeline that decides whether it works in the real world: cleaning the audio first with noise suppression so ASR hears words instead of café clatter; gating with voice activity so cost tracks speakers; a glossary of your product's proper nouns so "Fora Soft" is never translated as a soft cushion; and a consent and disclosure story for any voice cloning, which is increasingly a legal requirement rather than a nicety. The pipeline is the spine; these are the muscles that keep it upright.
Where Fora Soft Fits In
We build the live-video products where translation removes a hard barrier — global webinar and conferencing platforms, tele-medicine consultations across languages, e-learning for international cohorts, and remote simultaneous interpreting tools — and we build the translation on the SFU-side pipeline this article describes because it is the version that survives a real customer's bill and a real customer's auditor. The discipline we apply is the one argued here: translate each speaker once at the media server you already run, gate the pipeline with voice activity so cost tracks speakers rather than seats, fan the result out per listener language as captions on the data channel and optional ducked audio, and tune the read-write lag so the translation is honest rather than fast-and-wrong. Because we work in healthcare and education, we plan from the start for the parts that are easy to forget — a written record for audit, a glossary for domain terms, and a consent path for any synthesized voice — rather than bolting them on after launch.
What To Read Next
- Real-time multilingual speech translation in calls
- Speech-to-speech — Realtime API, Gemini Live, SeamlessM4T
- Live captions — the SFU-side ASR fan-out pattern
Talk To Us / See Our Work / Download
- Talk to a video engineer about adding real-time translation to your WebRTC product → /livekit-ai-agent-development-experts
- See our case studies in conferencing, e-learning, telemedicine, and remote interpreting → /cases
- Download the Real-Time Translation Engineering Cheat Sheet (one page, printable) → Download the cheat sheet
References
- ITU-T. Recommendation G.114 — One-way transmission time, International Telecommunication Union, Telecommunication Standardization Sector, May 2003, accessed 2026-06-02.
https://www.itu.int/rec/T-REC-G.114. Primary standards source for the conversational latency thresholds: a comfortable one-way mouth-to-ear delay below 150 ms for untranslated voice, with degradation as delay grows. Used as the lower anchor of the latency-budget thresholds; translation deliberately runs far above it, which is why the section contrasts the two. - IETF. RFC 8831 — WebRTC Data Channels, Standards Track, January 2021, accessed 2026-06-02.
https://www.rfc-editor.org/rfc/rfc8831. Primary standards source for the caption-delivery transport: data channels over SCTP, with reliability and ordering as per-message properties. Establishes how translated captions are fanned out to participants and why they are usually sent reliable-and-ordered. - IETF. RFC 6716 — Definition of the Opus Audio Codec, Standards Track, September 2012, accessed 2026-06-02.
https://www.rfc-editor.org/rfc/rfc6716. Primary standards source for the audio codec that carries both the original call voice and the injected translated-audio track in a WebRTC session. Cited for the "translated audio as a new Opus track" delivery path. - ISO. ISO 24019:2022 — Simultaneous interpreting delivery platforms — Requirements and recommendations, International Organization for Standardization, 2022, accessed 2026-06-02.
https://www.iso.org/standard/77590.html. Primary standards source for the requirements of platforms that deliver simultaneous interpretation (sound and image quality, channel handling, interpreter working conditions). Anchors the article's treatment of translated-audio delivery and per-listener-language channels in the controlling interpreting standard rather than a vendor blog. - ISO. ISO 20108:2017 — Simultaneous interpreting — Quality and transmission of sound and image input — Requirements, International Organization for Standardization, 2017, accessed 2026-06-02.
https://www.iso.org/standard/67175.html. Primary standards source for the sound-input quality that interpreting (human or machine) depends on — directly motivating the noise-suppression-before-ASR step in the support layer. - Meta AI (Seamless Communication team). Seamless: Multilingual Expressive and Streaming Speech Translation, arXiv:2312.05187, December 2023, accessed 2026-06-02.
https://arxiv.org/abs/2312.05187. Tier-3 first-party research source for the direct speech-to-speech approach: SeamlessStreaming (simultaneous S2ST), SeamlessExpressive (prosody and voice preservation), and the unified Seamless model. Basis for the cascade-vs-direct comparison and the "keeps the voice" claim. - Apple. Use Live Translation with your AirPods and New Apple Intelligence features (Newsroom, September 2025), accessed 2026-06-02.
https://support.apple.com/en-us/123185. Vendor source for the on-device earbud case: AirPods Pro 2 and later with iOS 26 run the language models on the iPhone so conversation data stays on device; beta language coverage and the noisy-environment iPhone-mic enhancement. Used for the "where the brain sits" on-device example. - Google. Pixel Live Translate / Translate with Google Pixel Buds and Gemini-powered live translation coverage, accessed 2026-06-02.
https://store.google.com/us/magazine/pixel-live-translate. Vendor source for the phone-plus-cloud earbud case: interpreter and conversation modes, Gemini-class models, 70+ languages, and tone/cadence preservation. Used for the cloud-side "where the brain sits" example. - Fora Soft. Real-Time Speech Translation Vendors in 2026: 4 Tools Compared (DeepL Voice, KUDO, Interprefy, Meta SeamlessM4T), May 2026, accessed 2026-06-02.
https://www.forasoft.com/blog/article/real-time-speech-translation-vendor-benchmarks. First-party companion analysis for the build-vs-buy economics, first-chunk-latency framing (<800 ms feels live; AIIC 3–5 s ceiling; ITU-T G.114), the cascade-wins-procurement point, and per-minute cost ranges cited in the cost section. - AIIC (International Association of Conference Interpreters). Guidelines / technical standards for distance and remote simultaneous interpreting, accessed 2026-06-02.
https://aiic.org/. Professional-body source for the sustainable end-to-end latency ceiling of 3–5 seconds for remote simultaneous interpreting, used as the upper anchor of the latency thresholds and the human benchmark the machine budget is measured against. - Ko, Kim, et al. / interpreting-studies literature on ear-voice span (décalage) in simultaneous interpreting, peer-reviewed, accessed 2026-06-02.
https://benjamins.com/catalog/intp.00116.jan. Academic source for the human ear-voice span range (~2–6 s, ~3 s typical) that motivates the read-write/wait-k analogy and the "deliberate wait is larger than machine delay" point. - arXiv. Direct Speech to Speech Translation: A Review, arXiv:2503.04799, 2025, accessed 2026-06-02.
https://arxiv.org/abs/2503.04799. Academic survey source for the cascade-versus-direct trade-offs (error propagation, prosody loss, latency) and the state of S2ST models (Translatotron 2, UnitY, SeamlessM4T v2) underpinning the comparison table.


