Real-Time Multilingual Speech Translation In Calls

Why This Matters

If your product puts people from different countries in the same call — a sales meeting, a telemedicine visit, a webinar, a classroom, a support line — language is the wall, and real-time translation is the only thing that takes it down without hiring a human interpreter for every conversation. This article is for the product manager, founder, or engineering lead deciding whether to build that feature, buy it, or wait. By the end you will understand the one trade-off that governs every real-time translation system — speed against accuracy — well enough to judge a vendor's demo, the four kinds of system competing for this job in 2026 and what each is actually good at, and the legal and quality traps that look invisible in a scripted demo and surface the moment real users with real accents join. The goal is that you walk into a build-or-buy conversation able to ask the questions that matter instead of being sold on a single impressive clip.

What "Real-Time Translation In A Call" Actually Means

Start with the plainest version. You are on a video call. You speak English; the other person speaks Spanish. A real-time translation feature listens to you, works out what you said, and delivers it to the other person in Spanish — while you are still in the conversation, not after the meeting ends. Then it does the same in reverse. The phrase that captures the goal is machine interpretation: software doing, live, the job a human interpreter does in a booth at the United Nations.

It helps to separate two shapes this can take, because products mix them up and buyers should not. The first is spoken translation, where the other person hears a voice saying your words in their language — the closest thing to a human interpreter. The second is caption translation, where the other person reads your words as translated subtitles on screen. Both are "real-time translation"; they feel completely different to use, and they have different engineering costs. Spoken translation is harder because it must also produce a natural-sounding voice fast enough to keep up. We will flag which systems do which.

Think of spoken translation as a phone call routed through a simultaneous interpreter who whispers in your ear, and caption translation as watching a foreign film with live subtitles. The interpreter feels more human but can talk over people; the subtitles never interrupt but make you read instead of listen. Most 2026 products offer both and let the user choose.

The Two Ways To Build It: Chain Three Models, Or Use One

Every real-time translation feature is built one of two ways, and the choice shapes its speed, its cost, and how natural it sounds. Picture the work as three steps. First, hear: turn the incoming sound into words. Second, translate: turn those words into the other language. Third, speak: turn the translated words back into sound.

The older, more common way gives each step its own specialist model. A speech-to-text model — the software that turns spoken audio into written words, abbreviated ASR for automatic speech recognition — does the hearing. A machine-translation model — the kind of model behind a translation app — does the translating, reading the words in one language and writing them in another. A text-to-speech model, abbreviated TTS, does the speaking. Engineers call this a cascade or a pipeline, because the output of one model flows into the next like water down a series of steps.

The newer way collapses the job into a single model that takes audio in one language and produces audio or text in another, with no separate hand-off between stages. Researchers call this end-to-end speech-to-speech translation, often shortened to S2ST. Meta's research team describes its streaming version as a model that generates "low-latency target translations without waiting for complete source utterances" — in plain words, it starts translating before you finish your sentence (Seamless Communication, 2023).

Here is the analogy that makes the trade-off click. The cascade is a relay race: three runners, each excellent at their leg, but every hand-off costs time and a baton can drop. The end-to-end model is one runner doing the whole lap — fewer hand-offs, faster, and the runner can carry your tone and emotion across the finish line that a baton hand-off would flatten. The cost is that you can no longer swap out a single weak runner; you get the whole runner or none of it.

Figure 1. The cascade chains three specialist models with readable text between each stage; the end-to-end model does the whole translation in one pass, trading inspectability for speed and a more natural voice.

Why the cascade still wins for many teams

It is tempting to assume the newer end-to-end model is simply better. In 2026 it is not the consensus, especially in business settings. The cascaded pipeline remains the enterprise default because it is transparent and debuggable: if the translation is wrong, you can read the transcript at each stage and see exactly where it broke — the hearing, the translating, or the speaking (Coval, 2026). For an audit trail or a regulated industry, that readable middle is not a nice-to-have; it is a requirement.

The cascade also lets you pick the best model for each leg — the most accurate speech-to-text, the strongest translation engine, the most natural voice — each from a different vendor, and swap any one of them without touching the rest. With modern streaming components, a well-tuned cascade reaches roughly one to two seconds end-to-end, which is fast enough for many meetings (industry benchmarks, 2026).

Why end-to-end wins when the feeling matters

The end-to-end model earns its place on two axes the cascade struggles with: speed and humanity. Because there are fewer hand-offs, the delay is lower — recent end-to-end speech-to-speech systems run roughly 400 to 700 milliseconds, where a millisecond is one-thousandth of a second, against the cascade's one to two seconds (industry benchmarks, 2026). And because the model never flattens your voice into plain text and back, it can keep the things text throws away. Meta's expressive variant specifically preserves "vocal styles and prosody" — the rhythm, the pauses, the speaking rate that make a voice sound like yours rather than a robot's (Seamless Communication, 2023). For a companion app, a premium support line, or anywhere the emotional texture of the voice is part of the product, that is worth the loss of a readable middle.

The Trade-Off That Governs Everything: Speed Against Accuracy

Here is the single idea that, once you hold it, makes every real-time translation decision legible. A translator — human or machine — faces a dilemma at every moment: translate now with what it has heard so far, or wait for more words to be sure. Translating early keeps the conversation flowing but risks getting it wrong, because the meaning of a sentence often is not clear until the end. Waiting buys accuracy but adds delay. There is no setting that wins both; you are always trading one for the other.

This is not a quirk of software — it is the same dilemma a human interpreter manages in real time, and languages make it harder or easier. Consider German, where the verb often lands at the very end of the sentence. An interpreter translating German into English literally cannot finish the English sentence until the German verb arrives, because English needs the verb early. So the interpreter either waits (more delay) or guesses (more risk). Google's own team found exactly this: languages with close structure, like Spanish, Italian, Portuguese, and French, were easier to translate live, while "structurally different languages such as German presented greater challenges" (Google, 2025).

Engineers have a name for the rule that decides when to commit. The simplest is the wait-k policy: the system reads a fixed number of words — call it k — before it starts translating, then keeps a steady distance behind the speaker (Ma et al., 2019). A small k means low delay and more mistakes; a large k means more delay and fewer mistakes. More advanced systems use an adaptive policy that decides moment to moment whether it has heard enough to safely translate — Meta's SeamlessStreaming uses a mechanism called Efficient Monotonic Multihead Attention to do exactly this, translating "without waiting for complete source utterances" while still waiting when the sentence demands it (Seamless Communication, 2023).

Figure 2. Accuracy rises as the system waits for more words, then flattens. The usable window for live conversation sits around two to three seconds — fast enough to feel live, slow enough to be understood.

So where is the usable window? Google's Meet team, after two years of work with DeepMind, landed on a precise answer: "We discovered that two to three seconds was sort of a sweet spot. Faster was difficult to understand; slower didn't lend itself to natural conversation" (Google, 2025). That number is worth memorizing, because it sets expectations. For comparison, the old transcribe-translate-speak chains of a few years ago ran ten to twenty seconds — long enough that natural conversation was impossible (Google, 2025). And there is a hard ceiling on the other side: once the delay passes about two seconds in a back-and-forth call, participants start talking over each other, because the rhythm of human turn-taking breaks (industry benchmarks, 2026).

Common pitfall. The most common mistake in a first build is translating word-for-word. Google's team noted their early model "translates most expressions literally, which can lead to amusing misunderstandings" (Google, 2025) — an idiom like "break a leg" comes out as an instruction to injure yourself. A literal translator looks fine in a demo of simple sentences and fails the moment real people use idioms, sarcasm, or industry jargon. The fix is a translation model with enough language understanding to translate meaning, not words — which is exactly why the field is moving toward larger models that grasp context, tone, and even irony.

The Four Kinds Of System That Define 2026

With the architecture and the speed-accuracy trade-off in hand, the 2026 landscape sorts into four kinds of system. Two are features baked into a platform you already use; one is an open model you host yourself; one is a category of enterprise interpretation service.

Built into the platform — Google Meet

Google Meet shipped real-time spoken translation to general users, the first of the big conferencing platforms to do so at scale. It uses a Gemini-based model that, in Google's description, listens to a speaker and delivers the translation "in a voice like yours," running in the two-to-three-second window the team identified (Google, 2025). As of early 2026 it had reached general availability, focused on bidirectional translation between English and a handful of major languages — Spanish, French, German, Portuguese, and Italian — with more in development (Google Workspace, 2026). The shape to notice: it is a managed feature, not something you build, and it keeps a faint echo of the original voice underneath the translation so the conversation still feels human.

A model you rent — OpenAI's gpt-realtime-translate

In May 2026 OpenAI released a model built specifically for this job: gpt-realtime-translate, a streaming speech-to-speech translation model reachable through a dedicated translation endpoint (OpenAI, 2026). Three details make it notable for anyone building a product. First, reach: it accepts more than 70 input languages and produces translated audio in 13 (OpenAI, 2026). Second, it was trained on thousands of hours of professional interpreter audio, which teaches it to stay translation-only — to interpret rather than answer — and to wait for enough context before it speaks (OpenAI, 2026). Third, it uses dynamic voice adaptation: rather than a fixed output voice, the translated speech follows the source speaker's tone, pitch, and style, and in a multi-speaker session the voice shifts as each new speaker comes in (OpenAI, 2026). It is priced at about $0.034 per minute of translation, and the standard pattern is one translation session per output language — so a meeting translated into three languages runs three sessions (OpenAI, 2026).

An open model you host — Meta's SeamlessStreaming

Meta's Seamless family is the open-weights option, meaning the trained model files are published for you to download and run on your own hardware. SeamlessStreaming is the streaming member built for live use; it is, in Meta's words, "the first of its kind" to enable simultaneous speech-to-speech and speech-to-text translation across many languages at once, using the adaptive attention mechanism described above to translate before a sentence finishes (Seamless Communication, 2023). It sits on the foundation model SeamlessM4T v2, which understands roughly 100 languages as speech input (Hugging Face, 2026). Two caveats matter. Its streaming latency is competitive with human interpreters — about two seconds — but the foundation model carries a non-commercial licence (CC-BY-NC), so shipping it inside a product you sell requires a separate arrangement with Meta (Hugging Face, 2026). Treat Seamless as the open reference design and confirm licensing before you build a business on it.

A service you contract — enterprise interpretation platforms

For high-stakes events — board meetings, multilingual conferences, earnings calls — a category of platforms sits between pure AI and human interpreters. KUDO, Interprefy, and Wordly each offer real-time AI translation and live captions, with the option to bring human interpreters into the loop for accuracy that AI alone cannot yet guarantee. Interprefy, a pioneer of remote simultaneous interpretation, integrates with more than 80 meeting platforms and advertises thousands of language combinations (Interprefy, 2026). KUDO reported that meetings using its AI speech translation and captions grew sharply year over year (KUDO, 2026). The pattern: when a mistranslation has real consequences, buyers pay for a platform that can fall back to a human, rather than trusting a model alone.

A four-column comparison card titled the four kinds of real-time translation system in 2026. Columns are Google Meet, OpenAI gpt-realtime-translate, Meta SeamlessStreaming, and Enterprise platforms such as KUDO, Interprefy, and Wordly. Aligned rows compare what it is, how you get it, voice handling, language reach, and best use, with the option that usually wins each row lightly tinted. Figure 3. The four kinds of system, lined up on what they are, how you get them, how they handle the voice, language reach, and where each fits best.

How It Rides Inside A Video Call

Knowing which model to use is half the job; the other half is getting the audio to and from it without wrecking the call's own latency. A video call already has a delivery system, and the translation has to ride on top of it.

Most modern video calls route audio and video through a selective forwarding unit, abbreviated SFU — a server that receives each participant's stream and forwards it to the others without mixing or re-encoding, which keeps the call's own delay low (Fora Soft, 2026). The translation feature plugs into this server as what engineers in 2026 call a sidecar: a separate component that taps the audio stream, sends it to the translation model, and injects the translated audio or captions back as a new track (industry trends, 2026). The original speaker's audio keeps flowing untouched; the translation arrives as an extra channel the listener can choose.

Here is why this design matters for your budget and your build. Because the translation is a separate track, each listener can pick their own language, and you pay for one translation session per target language rather than per participant. A 50-person webinar translated into Spanish, French, and German is three translation sessions feeding three audio tracks, not 50 — the SFU fans each translated track out to everyone who wants it. That is the same fan-out logic that makes live captions on the SFU side efficient, and the broader transport choices behind it are covered in the speech-to-speech lesson.

Figure 4. The translation sidecar taps the call server's audio, runs it through a translation model, and feeds translated tracks back. Listeners pick a language; you pay per language, not per person.

The Earbud Form Factor — Translation Off The Screen

Not every translated conversation happens in a conferencing app. The fastest-growing consumer version of this technology lives in earbuds, and it is worth understanding because buyers increasingly expect the same magic on a call. By 2026, live translation runs on flagship earbuds across the market — Google's Pixel Buds, Samsung's Galaxy Buds, and Apple's AirPods all offer some form of it (SoundGuys, 2026).

The notable shift in 2026 is that Google decoupled the feature from any specific hardware: live audio translation now works through any Bluetooth or wired headphones on Android, powered by a Gemini model running native audio with what Google describes as enhanced understanding of tone and cadence across more than 70 languages (TechRadar, 2026). For your product, the lesson is not that you should build earbuds — it is that your users have now experienced near-instant translation in their own ears, and a clunky, ten-second translation inside your app will feel broken by comparison. The bar has moved.

The Safety Layer Nobody Demos

A real-time translator that speaks in the user's own voice is, technically, a voice-cloning system running live — and that raises two real obligations that never appear in a demo.

The first is consent for the cloned voice. When the system reproduces a speaker's vocal style in another language, it is synthesizing that person's voice, and the same consent rules that govern any voice cloning apply. The legal and engineering details — including the NO FAKES Act consent framework — are covered in the ElevenLabs and voice-cloning lesson; the short version is that you must have a clear basis to clone a participant's voice, and "they joined the call" is not automatically it.

The second is provenance — proving a piece of audio was machine-generated. Meta built this into Seamless from the start with "an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes," along with toxicity detection that catches the model adding offensive content that was not in the original (Seamless Communication, 2023). For a translation feature, that toxicity check is not abstract: a model that mistranslates a neutral phrase into a slur creates a real incident, and detecting added toxicity at the output is a genuine safeguard, not paperwork.

What A Minute Of Translation Actually Costs

Pricing is where the architecture choice meets the budget, so it is worth doing the arithmetic out loud once. Take OpenAI's gpt-realtime-translate at about $0.034 per minute of translation, and remember the rule: one session per target language (OpenAI, 2026). Now price a one-hour webinar translated into three languages:

one session per language        = 3 sessions (Spanish, French, German)
minutes per session             = 60 minutes
cost per minute                 = $0.034
   ───────────────────────────────────────────────
   3 × 60 × $0.034              = $6.12 for the hour

Six dollars and twelve cents to make a one-hour webinar understandable in three extra languages. Set that against the alternative — a human simultaneous interpreter typically bills by the half-day at a rate orders of magnitude higher, and needs one per language pair — and the reason AI translation is spreading becomes obvious. The honest caveat: the human still wins on accuracy for high-stakes content, which is exactly why the enterprise platforms keep a human in the loop. The discipline is to model your own volume and your own accuracy needs the same way the broader cost model for AI in video products lesson lays out. The one-page checklist below packages this decision so you can run it in a meeting.

Where Fora Soft Fits In

Real-time translation is not a single feature but a pattern that shows up across the products we build. In video conferencing it is a live interpreter between participants who never share a language, delivered as either a spoken track or translated captions. In telemedicine it lets a clinician and a patient consult across a language barrier, where a mistranslation has real clinical weight and a human fallback matters. In e-learning it opens a course recorded in one language to learners worldwide. In live broadcast and OTT it becomes live dubbing of an event as it happens. The engineering judgment in each case is the same one this article frames: cascade or end-to-end, how long to wait before committing a translation, which of the four kinds of system fits, and how the audio rides the call — and that judgment, made early, is what separates a feature that genuinely connects people from one that produces amusing nonsense the moment a real accent arrives.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your real time translation plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Real-Time Translation Architecture Decision Checklist — One-page printable: cascade-vs-end-to-end, the 2-3 second latency window, the speed-vs-accuracy trade-off, the WebRTC SFU sidecar pattern, the consent and provenance checks, and a side-by-side of Google Meet, OpenAI….

References

Google. How AI made Meet's language translation possible, 11 September 2025, accessed 2026-05-31. https://blog.google/products-and-platforms/products/workspace/google-meet-langauge-translation-ai/. First-party source for: the old transcribe-translate-speak chain running 10–20 seconds; the "one-shot" large-model breakthrough where audio in produces audio out almost immediately; the "two to three seconds was sort of a sweet spot" finding (faster is hard to understand, slower breaks natural conversation); translation delivered "in a voice like yours"; closely related languages (Spanish, Italian, Portuguese, French) being easier than structurally different ones (German); and the early model translating expressions literally with the expectation that advanced LLMs will capture tone and irony. Tier 3 (first-party engineering blog).
Google Workspace. Control Speech Translation in Google Meet for your users and Meet Help: Learn about Speech Translation, January 2026, accessed 2026-05-31. https://workspaceupdates.googleblog.com/2026/01/control-speech-translation-in-meet.html and https://support.google.com/meet/answer/16221730. First-party source for general availability of Meet speech translation in early 2026 and the supported English-paired languages (Spanish, French, German, Portuguese, Italian). Tier 3.
OpenAI. Realtime translation guide, gpt-realtime-translate model page, and Advancing voice intelligence with new models in the API, May 2026, accessed 2026-05-31. https://developers.openai.com/api/docs/guides/realtime-translation, https://developers.openai.com/api/docs/models/gpt-realtime-translate, and https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/. First-party source for: gpt-realtime-translate as a streaming speech-to-speech translation model on the dedicated /v1/realtime/translations endpoint; 70+ input languages and 13 output languages; training on thousands of hours of professional interpreter audio to stay translation-only and wait for context; dynamic voice adaptation following the source speaker's tone and shifting per speaker; pricing ~$0.034/min; and one translation session per output language. Tier 3.
Seamless Communication (Meta AI). Seamless: Multilingual Expressive and Streaming Speech Translation, arXiv:2312.05187, 8 December 2023, accessed 2026-05-31. https://arxiv.org/abs/2312.05187. Primary academic source for: SeamlessM4T v2 with the UnitY2 framework as the foundation model; SeamlessStreaming using Efficient Monotonic Multihead Attention to generate low-latency translations without waiting for complete source utterances; SeamlessExpressive preserving vocal style and prosody (speech rate, pauses); and the safety layer — red-teaming, added-toxicity detection and mitigation, gender-bias evaluation, and inaudible localized watermarking to dampen deepfakes. Tier 5 (peer-style research publication; itself the primary source for the algorithm).
Hugging Face. SeamlessM4T-v2 model documentation and facebook/seamless-m4t-v2-large model card, accessed 2026-05-31. https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2 and https://huggingface.co/facebook/seamless-m4t-v2-large. Source for the ~100-language speech-input coverage, the ~2-second streaming latency comparable to human interpreters, and the CC-BY-NC (non-commercial) licence. Tier 4.
Ma, Mingbo et al. STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework, ACL 2019. https://arxiv.org/abs/1810.08398. Primary source for the wait-k policy — reading k source tokens before beginning to translate, then maintaining a fixed distance behind the speaker — and the fixed-versus-adaptive latency-policy distinction. Tier 5.
Coval. Speech-to-Speech vs Cascaded Voice AI: Which Architecture Should You Deploy?, 2026, accessed 2026-05-31. https://www.coval.ai/blog/speech-to-speech-vs-cascaded-voice-ai-which-architecture-should-you-deploy. Independent source for the cascade remaining the 2026 enterprise default for transparency, debuggability, and component-swapping, and for the audit-trail/compliance argument. Tier 4.
Interprefy and KUDO (vendor sources). Live translation options (Interprefy) and KUDO 2026 AI-translation usage reporting, accessed 2026-05-31. https://www.interprefy.com/resources/blog/how-to-enhance-new-live-translation-options-for-google-meets. Source for the enterprise remote-simultaneous-interpretation category: Interprefy's 80+ platform integrations and thousands of language combinations; KUDO's year-over-year growth in AI translation and captioned meetings; the human-in-the-loop fallback pattern. Tier 6 (vendor marketing; used for category framing only, not for any spec claim).
SoundGuys. The best translation earbuds in 2026 and Google adds real-time audio translation to any headphones, accessed 2026-05-31. https://www.soundguys.com/best-translation-earbuds-2-152438/. Source for the 2026 earbud-translation market spanning Google Pixel Buds, Samsung Galaxy Buds, and Apple AirPods. Tier 6.
TechRadar. Google smashes language barriers with live translation for any earbuds on Android, 2026, accessed 2026-05-31. https://www.techradar.com/computing/software/google-smashes-language-barriers-with-live-translation-for-any-earbuds-on-android-heres-how-it-works. Source for Google decoupling live audio translation from specific hardware (any Bluetooth/wired headphones on Android), powered by a Gemini native-audio model across 70+ languages. Tier 6.
Fora Soft. WebRTC Architecture for Production: SFU, MCU, MoQ Guide, accessed 2026-05-31. https://www.forasoft.com/learn/webrtc-architecture-production-systems. First-party reference for the selective-forwarding-unit (SFU) model — forwarding each participant's stream without mixing or re-encoding to keep latency low — that the translation sidecar plugs into. Tier 3.
Industry benchmarks and trends, 2026 (aggregated). Real-time translation latency bands — end-to-end speech-to-speech ~400–700 ms, cascaded ~1–2 s, and the >2-second point at which participants begin talking over each other — and the 2026 "AI sidecar" media-pipeline pattern. Used for latency-band orientation; where these aggregated figures touch a spec-level claim, the article defers to the first-party and academic sources above. Tier 6.