Speech-To-Speech — Realtime API, Gemini Live, SeamlessM4T

Why This Matters

If your product is going to hold a spoken conversation — a meeting assistant that talks back, a telemedicine intake line, a customer-support voice agent, a live interpreter inside a video call, or a language tutor — you face one early decision that sets your latency, your bill, and how much you can debug later. That decision is whether to use a single speech-to-speech model or to chain a speech-to-text model, a language model, and a text-to-speech model yourself. This article is for the product manager, founder, or engineering lead making that call. By the end you will understand why "under 800 milliseconds voice-to-voice" is the number that decides whether the conversation feels alive, how the three leading 2026 systems differ on speed, cost, and openness, and the specific engineering traps that look fine in a demo and break in production. The goal is that you can walk into a build-or-buy conversation and ask the right questions instead of being sold on a single headline.

What "Speech-To-Speech" Actually Means

Start with the plainest definition. A speech-to-speech system, often shortened to S2S, takes spoken sound in and gives spoken sound back. You talk; it talks. The phrase covers two quite different jobs that share that shape, and it is worth separating them now because the rest of the article keeps returning to the split.

The first job is conversation: you ask a question in English and the system answers your question in English, like a voice assistant. The second job is translation: you speak English and the system says the same meaning in Spanish, like a live interpreter. Both take voice in and give voice out, so both are "speech-to-speech", but the machinery and the leading products differ. OpenAI's Realtime API and Google's Gemini Live are built mainly for the conversation job; Meta's SeamlessM4T is built for the translation job. We will keep flagging which is which.

Think of the conversation job as a phone call with a clever colleague, and the translation job as a phone call routed through a simultaneous interpreter. In both cases you never see a transcript on screen — the point is that sound goes in and sound comes out. Whether a transcript exists inside the system, and whether anyone can read it, is exactly the architectural question we turn to next.

The Two Ways To Build It: Chain Three Models Or Use One

Every speech-to-speech product is built one of two ways, and the choice colors everything else. Picture the work as three steps. First, hear: turn the incoming sound into something the machine can reason about. Second, think: decide what to say. Third, speak: turn that decision back into sound.

The older, more common way gives each step its own specialist model. A speech-to-text model — the software that turns spoken audio into written words, abbreviated ASR for automatic speech recognition — does the hearing. A language model — the kind of model behind a chatbot — does the thinking, reading the text and writing a reply in text. A text-to-speech model, abbreviated TTS, does the speaking, turning that reply back into audio. Engineers call this a cascade or a pipeline, because the output of one model flows into the next like water down a series of steps.

The newer way collapses all three into a single model that takes audio in and produces audio out directly, with no readable text required in between. OpenAI calls the result a "speech-to-speech model" and notes that, unlike traditional pipelines that chain speech-to-text and text-to-speech, it "processes and generates audio directly through a single model and API", which "reduces latency, preserves nuance in speech, and produces more natural, expressive responses" (OpenAI, 2025). Researchers call this an end-to-end model, because one model spans the whole job end to end.

Here is the analogy that makes the trade-off click. The cascade is a relay race: three runners, each excellent at their leg, but every handoff costs time and a baton can drop. The end-to-end model is one runner doing the whole lap: fewer handoffs, faster, and the runner can carry tone and emotion across the finish line that a baton handoff would lose. The cost is that you can no longer swap out one weak runner — you get the whole runner or none of it.

Figure 1. The cascade chains three specialist models with readable text between each; the end-to-end model does the whole job in one pass, trading inspectability for speed and naturalness.

Why the cascade still wins for many teams

It would be easy to assume the newer end-to-end model is simply better. In 2026 that is not the consensus. For most production deployments the cascaded pipeline remains the default, because it gives you transparency, debuggability, and the freedom to swap any component without touching the rest (Coval, 2026). If the system says the wrong thing, you can read the transcript at each stage and see exactly where it went wrong — the hearing, the thinking, or the speaking. With an end-to-end model there is no readable transcript in the middle to inspect, so a mistake is harder to trace.

The cascade also lets you pick the best model for each leg: the most accurate speech-to-text, the smartest language model, the most natural voice, each from a different vendor. The end-to-end model is a single vendor's take on all three at once, which means you live with its weakest leg and you accept what engineers call vendor lock-in — the cost of being unable to leave one supplier easily.

Why end-to-end wins when the feeling matters

The end-to-end model earns its place on two axes the cascade struggles with: speed and humanity. Because there are no handoffs, the delay is lower. And because the model never flattens your voice into plain text and back, it keeps the things text throws away — a laugh, a hesitation, a rising question tone, an accent. End-to-end speech-to-speech models respond immediately and naturally handle turn-taking, backchanneling, and interruption, while cascaded pipelines deliver stronger responses at the cost of latency that grows with model size (Coval, 2026). For a premium support line, a companion app, or anywhere the emotional texture of the voice is the product, that naturalness is worth the loss of inspectability.

The Number That Decides Whether It Feels Alive

There is one metric that separates a conversation that feels real from one that feels like a phone menu, and it is easy to measure the wrong thing. The right metric is voice-to-voice latency: the gap between the moment you stop speaking and the moment you hear the first sound of the reply. It is the silence the other party actually experiences.

Why does this exact number matter so much? Because human conversation has a built-in rhythm. When one person stops talking, the other normally begins within about two-tenths of a second — roughly 200 milliseconds, where a millisecond is one-thousandth of a second. That gap is so consistent across languages and cultures that our ears treat anything much longer as a hesitation or a dropped line. Push the reply gap past a second and the conversation stops feeling like a conversation.

So what is achievable today? OpenAI reports that around 800 milliseconds voice-to-voice is achievable with its Realtime API, with the time for the first byte to come back from the model running about 500 milliseconds in US regions, which leaves roughly 300 milliseconds for capturing the microphone, detecting that you stopped, sending audio over the network, and playing the reply (OpenAI, 2026). Notice the shape of that budget. The model is only half of it. The other half is everything around the model — and that is the half teams forget.

Let us walk the budget out loud, because seeing the arithmetic makes the point. Take the achievable total and subtract the model's share:

total voice-to-voice budget      = 800 ms   (target for "feels live")
   model time-to-first-byte      − 500 ms   (the API's own latency, US region)
   ────────────────────────────────────────
   everything else               = 300 ms   (capture + end-of-turn detection
                                              + network round trips + playback)

That 300-millisecond remainder has to cover the microphone buffer, the moment of deciding you finished speaking, the trip to the data center and back, and starting playback in the listener's ear. If your users are far from the model's region, the network alone can eat most of it. This is why "the model is fast" is never the whole answer — and why the connection you choose, which we cover next, moves the needle as much as the model does.

Figure 2. An 800 ms voice-to-voice budget. The model is roughly half; capture, end-of-turn detection, network, and playback fill the rest.

Three Ways To Connect: Browser, Server, Or Phone

A speech-to-speech model lives in a data center; your user's voice starts at a microphone somewhere else. The path between them — the transport — is a real engineering choice with real latency consequences, and the leading APIs give you three options. OpenAI's Realtime API supports all three: WebRTC, WebSocket, and SIP (OpenAI, 2025).

The first is WebRTC, a browser technology built specifically for live audio and video between two points, designed from the start to keep delay low. Use it when your audio is captured or played in a browser or mobile app — that is, when the user's own device is one end of the call. OpenAI's guidance is direct: for browser and mobile clients that capture or play audio directly, use WebRTC (OpenAI, 2026).

The second is WebSocket, a simpler always-open connection between two servers. Use it when your own server already has the audio — for example a phone system or a broadcast feed flowing into your backend — and you want to forward it to the model. OpenAI's rule of thumb: for server media pipelines such as phone calls or broadcast ingest, use WebSockets (OpenAI, 2026).

The third is SIP, the protocol that runs the public phone network — the same plumbing behind every business desk phone. OpenAI added direct SIP support so a voice agent can answer an actual phone call, connect to office phone systems, and reach other telephone endpoints without you building the bridge yourself (OpenAI, 2025). If your product is a phone line, this is the door.

Figure 3. Pick the transport by where the audio lives: the user's device (WebRTC), your server (WebSocket), or the phone network (SIP).

The Hardest Part Nobody Demos: Knowing When To Talk

A demo recorded by one polite person taking neat turns hides the single hardest problem in voice agents: knowing when the human has finished, and what to do when the human cuts in. Get this wrong and the agent either talks over people or sits in awkward silence. Two pieces of machinery handle it.

The first is voice activity detection, abbreviated VAD — a small, fast component whose only job is to tell speech apart from silence and background noise. It is how the system notices you started talking and, crucially, notices you stopped. Detecting the stop is called end-of-turn detection, and it is the trigger that tells the agent "your move". Set the silence threshold too short and the agent interrupts you mid-thought; set it too long and every reply feels sluggish. Combining VAD with the speech-to-text model's confidence improves the accuracy of that judgment (Coval, 2026). Background noise makes this harder, which is why teams often pair a speech-to-speech feature with the kind of real-time noise suppression covered elsewhere in this section — cleaner audio gives the VAD a clearer signal.

The second is interruption handling, often called barge-in: what happens when the user starts talking while the agent is still talking. Humans interrupt constantly, and a system that ignores it feels broken. In a cascaded pipeline the handling is explicit work: when the VAD detects the user's speech while the agent is mid-sentence, it fires an interruption event that stops the agent's audio immediately, throws away the audio already queued to play, and restarts the listening pass (Coval, 2026). End-to-end models tend to handle turn-taking, backchanneling, and interruption more naturally, because the single model was trained on the rhythm of real dialogue rather than having interruption bolted on afterward (Coval, 2026). This native fluency is one of the strongest reasons to reach for an end-to-end model when the conversation needs to feel human.

Common pitfall. The most frequent reason a voice agent that "worked in the demo" fails in production is not the model — it is end-of-turn detection and barge-in. Teams tune the silence threshold on their own clean speech in a quiet room, then ship to users with accents, background noise, and the human habit of pausing mid-sentence to think. The agent starts cutting people off. Always test end-of-turn detection on real, messy audio from your actual users before you trust a threshold, and budget engineering time for interruption handling as a first-class feature, not a finishing touch.

The Three Systems That Define 2026

With the architecture, the latency budget, the transports, and the turn-taking problem in hand, the three named systems become easy to place. Two are conversation engines you rent; one is a translation model you can host yourself.

OpenAI Realtime API — the production conversation engine

OpenAI's Realtime API is the most widely deployed speech-to-speech conversation system in 2026. It became generally available — that is, stable and supported for production rather than an experiment — alongside a model called gpt-realtime, OpenAI's first general-availability speech-to-speech model, which responds to audio and text over WebRTC, WebSocket, or SIP (OpenAI, 2025). Through 2026 the family expanded: the current documentation lists gpt-realtime-2 for voice agents, a dedicated gpt-realtime-translate for live translation, and gpt-realtime-whisper for streaming transcription, with the newest voice models adding the ability to reason, translate, and transcribe as they listen (OpenAI, 2026; 9to5Mac, 2026).

The model is measurably stronger than the December 2024 version it replaced. On the Big Bench Audio evaluation, which measures reasoning over spoken input, gpt-realtime scores 82.8 percent against the older model's 65.6 percent; on an instruction-following benchmark it scores 30.5 percent against 20.6 percent; on a function-calling benchmark — the model's ability to trigger your app's tools at the right moment — it scores 66.5 percent against 49.7 percent (OpenAI, 2025). Those last two matter for real products: a support agent has to follow a disclaimer script word for word and look up an order at the right time, and both improved sharply.

It is also reachable through Microsoft's Azure as the "GPT Realtime API", so teams already on Azure can adopt it without leaving their cloud (Microsoft, 2026).

Gemini Live — native audio, priced to scale

Google's answer is Gemini Live, the live-conversation mode of its Gemini models. Its headline technical idea is native audio: a single model processes raw sound directly rather than converting it to text first, which is the move that cuts latency (Google Cloud, 2026). In the API the live models accept text, images, audio, and video and are built for low-latency back-and-forth dialogue, with the conversation carried over a WebSocket connection (Google, 2026).

The reason Gemini Live shows up in every 2026 build-or-buy comparison is price. Its audio is billed per token of sound, with input audio counted at 32 tokens per second and output audio at 25 tokens per second (Google, 2026). At high volume that token math comes out dramatically cheaper than OpenAI's per-minute rate — independent comparisons in 2026 put Gemini's input-audio price roughly an order of magnitude below OpenAI's for high-volume voice work (Finout, 2026). If your product runs millions of minutes of conversation, that gap is the difference between a feature that pays for itself and one that does not. We put real numbers on this in the cost section below.

SeamlessM4T — the open translation model you can host

Meta's SeamlessM4T is the odd one out, and deliberately so: it is built for the translation job, not the conversation job, and it is open-weights, meaning the trained model file is published for you to download and run on your own hardware. Version 2, released by Meta's Seamless Communication team, is a single model that handles speech-to-speech translation, speech-to-text, text-to-speech, and plain transcription across a wide language span — roughly 101 languages understood as speech input and direct speech translation output covering dozens of languages (Meta AI, 2023; Hugging Face, 2026).

Under the hood it is, in fact, a cascade hidden inside one released package: a speech encoder built on a 24-layer wav2vec-BERT model pre-trained on 4.5 million hours of audio, a text decoder borrowed from Meta's NLLB translation model, and a non-autoregressive unit decoder called UnitY2 that generates the sound of the translated speech, finished by a vocoder that turns those units into an audible wave (Meta AI, 2023). The UnitY2 piece is the v2 advance: it predicts the speech units in a faster, more data-efficient way than v1, improving both quality and speed (Hugging Face, 2026).

The catch is the licence. SeamlessM4T is released under CC-BY-NC — the "NC" means non-commercial — so you can research and prototype with it freely, but using it inside a product you sell requires a separate commercial arrangement with Meta (Hugging Face, 2026). For a live interpreter feature you intend to ship, treat SeamlessM4T as the reference design and the open baseline, and confirm licensing before you build a business on it. For real-time translation inside a call, where you also care about how it rides the WebRTC pipeline, this article hands off to the dedicated lessons on real-time multilingual speech translation in calls and real-time speech translation in a WebRTC call rather than duplicating them here.

How The Three Compare

The table below lines the three systems up on the axes that decide a build: what job they do, whether you host or rent, how you connect, language reach, and the openness trade-off. The lightly tinted cell in each row is the option that usually wins that criterion — though "wins" depends entirely on which job you are doing.

Criterion	OpenAI Realtime API	Gemini Live	SeamlessM4T v2
Primary job	Conversation (voice agent)	Conversation (voice agent)	Translation (interpreter)
Architecture	End-to-end speech-to-speech	End-to-end, native audio	Modular cascade in one package
Host or rent	Rent (hosted API)	Rent (hosted API)	Host (open-weights)
Connection	WebRTC · WebSocket · SIP	WebSocket	You wire it up yourself
Audio-out languages	Multilingual, can switch mid-sentence	Multilingual	~Dozens, direct speech translation
Cost shape	Per-token audio, ~$0.18–0.46/min	Per-token audio, ~10× cheaper input	GPU + electricity (self-host)
Inspectable middle	No readable transcript	No readable transcript	Yes — text stage is exposed
Licence	Commercial, pay-as-you-go	Commercial, pay-as-you-go	CC-BY-NC (non-commercial)
Best when	Production voice agent, phone support	High-volume conversation on a budget	Open translation baseline, research

The pattern to take away: for a conversational voice agent you are choosing between two rented end-to-end engines, where OpenAI leads on connection options and production polish and Gemini leads on price at scale. For a translation feature you are in a different market, where SeamlessM4T is the open reference but its non-commercial licence is a gate you must clear.

What A Minute Of Conversation Actually Costs

Pricing is where the architecture choice meets the budget, so it is worth doing the arithmetic out loud once. Hosted speech-to-speech APIs bill by the token, and for audio a token is a slice of time, not a slice of text. On OpenAI's Realtime API, user audio is counted at one token per 100 milliseconds and the assistant's spoken audio at one token per 50 milliseconds (CallSphere, 2026). Let us turn that into a per-minute figure.

Take one minute of the user talking. One minute is 60 seconds, which is 60,000 milliseconds. At one token per 100 milliseconds:

user audio tokens per minute = 60,000 ms ÷ 100 ms/token = 600 tokens

Now one minute of the assistant talking back, at one token per 50 milliseconds:

assistant audio tokens per minute = 60,000 ms ÷ 50 ms/token = 1,200 tokens

gpt-realtime is priced at $32 per million audio input tokens and $64 per million audio output tokens (OpenAI, 2025). So a minute of the user speaking and a minute of the assistant replying costs:

input cost  = 600 tokens   × $32 / 1,000,000 = $0.0192
output cost = 1,200 tokens × $64 / 1,000,000 = $0.0768
   ──────────────────────────────────────────────────
   per-minute (both directions, uncached)   ≈ $0.096

In practice a real agent also re-sends conversation history each turn, which is why independent measurements put a typical uncached minute between roughly $0.18 and $0.46, falling to $0.05–$0.10 once prompt caching — reusing the unchanging part of the prompt at a steep discount — is switched on (CallSphere, 2026). The lever there is real: OpenAI prices cached audio input at $0.40 per million tokens against $32 for fresh input, an 80-fold discount on the repeated part (OpenAI, 2025).

This is exactly where Gemini Live changes the math. Because its audio input is billed at a far lower token rate, a product running millions of minutes a month can see its largest line item — input audio — drop by roughly an order of magnitude (Finout, 2026). At small scale the difference is noise; at large scale it can decide whether the feature ships. The discipline is to model your own volume against both price sheets before you commit, the same way the broader cost model for AI in video products lesson lays out. The one-page checklist below packages this comparison so you can run it in a meeting.

Where Fora Soft Fits In

Speech-to-speech is not a single feature but a pattern that shows up across the products we build. In video conferencing it becomes a meeting assistant that listens and replies aloud, or a live interpreter between participants. In telemedicine it becomes a low-latency intake line that talks a patient through pre-visit questions. In e-learning it becomes a conversational tutor that a learner can interrupt and ask to slow down. In OTT and Internet TV it becomes spoken navigation and live dubbing. The engineering judgment in each case is the same one this article frames: cascade or end-to-end, which transport, whose price sheet, and how carefully the turn-taking is tuned — and that judgment, made early, is what keeps a voice feature from feeling robotic once real users arrive.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your openai realtime api plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Speech-to-Speech Architecture Decision Checklist — One-page printable: cascade-vs-end-to-end, the 800 ms voice-to-voice budget, the WebRTC/WebSocket/SIP transport choice, the turn-taking and barge-in tests, and a side-by-side of OpenAI Realtime API, Gemini Live, and SeamlessM4T on job,….

References

OpenAI. Introducing gpt-realtime and Realtime API updates for production voice agents, August 2025, accessed 2026-05-31. https://openai.com/index/introducing-gpt-realtime/. First-party source for: the Realtime API reaching general availability; gpt-realtime as OpenAI's first GA speech-to-speech model processing and generating audio through a single model and API; support for WebRTC, WebSocket and SIP; the benchmark scores (Big Bench Audio 82.8% vs 65.6%, MultiChallenge 30.5% vs 20.6%, ComplexFuncBench 66.5% vs 49.7%); image input and remote MCP support; the new Cedar and Marin voices; and the pricing ($32 / 1M audio input tokens, $0.40 cached input, $64 / 1M audio output tokens — a 20% cut vs gpt-4o-realtime-preview).
OpenAI. Realtime and audio (Realtime API overview and architecture guide), accessed 2026-05-31. https://developers.openai.com/api/docs/guides/realtime. First-party source for: the May-2026 model family (gpt-realtime-2 for voice agents, gpt-realtime-translate for live translation, gpt-realtime-whisper for transcription); the transport rule (WebRTC for browser/mobile device audio, WebSocket for server media pipelines, SIP for telephony); the dedicated translation session on /v1/realtime/translations; and Realtime 2 adding reasoning to speech-to-speech workflows.
OpenAI. Delivering low-latency voice AI at scale, accessed 2026-05-31. https://openai.com/index/delivering-low-latency-voice-ai-at-scale/. First-party source for the ~800 ms voice-to-voice achievable figure, the ~500 ms API time-to-first-byte in US regions, and the ~300 ms left for capture, VAD, network and rendering.
Google. Gemini Developer API — Live API capabilities and pricing, accessed 2026-05-31. https://ai.google.dev/gemini-api/docs/live-api/capabilities and https://ai.google.dev/gemini-api/docs/pricing. First-party source for: native-audio Live models accepting text, image, audio and video; WebSocket transport; and the audio token rates (input audio 32 tokens/sec, output audio 25 tokens/sec).
Google Cloud. How to use Gemini Live API Native Audio in Vertex AI, accessed 2026-05-31. https://cloud.google.com/blog/topics/developers-practitioners/how-to-use-gemini-live-api-native-audio-in-vertex-ai. First-party source for native audio as the core latency-reducing innovation (single model processes raw audio directly) and the 500–800 ms audio-chunk latency guidance.
Meta AI / Seamless Communication. Seamless: Multilingual Expressive and Streaming Speech Translation and the SeamlessM4T research page, December 2023, accessed 2026-05-31. https://ai.meta.com/research/seamless-communication/ and arXiv:2312.05187. Primary source for SeamlessM4T v2's modular architecture: 24-layer wav2vec-BERT 2.0 speech encoder pre-trained on 4.5M hours of audio, NLLB-derived text decoder, the UnitY2 non-autoregressive unit decoder with hierarchical upsampling, and the HiFi-GAN-inspired vocoder; and the unified S2ST / S2TT / T2ST / T2TT / ASR task coverage.
Hugging Face. SeamlessM4T-v2 model documentation and facebook/seamless-m4t-v2-large model card, accessed 2026-05-31. https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2 and https://huggingface.co/facebook/seamless-m4t-v2-large. Source for the language coverage (~101 languages speech input, 96 text, ~35 speech output), the UnitY2 quality-and-speed improvement over v1, and the CC-BY-NC 4.0 (non-commercial) licence.
Coval. Speech-to-Speech vs Cascaded Voice AI: Which Architecture Should You Deploy?, 2026, accessed 2026-05-31. https://www.coval.ai/blog/speech-to-speech-vs-cascaded-voice-ai-which-architecture-should-you-deploy. Independent voice-AI evaluation source for: the cascade remaining the 2026 production default for transparency, debuggability and component-swapping; end-to-end models handling turn-taking, backchanneling and interruption more naturally; the VAD-plus-ASR-confidence accuracy point; and the explicit interruption-event / flush-queue / restart-STT barge-in mechanics. Where this independent source and a vendor's framing diverge, the article follows the independent source and labels vendor claims as such.
CallSphere. OpenAI Realtime API Cost Per Minute: The Real Math for 2026, accessed 2026-05-31. https://callsphere.ai/blog/vw2c-openai-realtime-cost-per-minute-math-2026. Source for the audio-token duration encoding (user audio 1 token / 100 ms, assistant audio 1 token / 50 ms), the resulting per-minute ranges ($0.18–0.46 uncached, $0.05–0.10 cached), and the worked token arithmetic; cross-checked against the first-party OpenAI pricing in reference 1.
Finout. Gemini Pricing in 2026 for Individuals, Orgs & Developers, accessed 2026-05-31. https://www.finout.io/blog/gemini-pricing-in-2026. Source for the roughly order-of-magnitude input-audio price advantage of Gemini Live over OpenAI Realtime for high-volume voice applications.
Microsoft. Use the GPT Realtime API for speech and audio with Azure OpenAI (and the WebRTC / WebSockets how-to pages), accessed 2026-05-31. https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/realtime-audio. First-party source confirming the GPT Realtime API is available through Azure OpenAI over WebRTC and WebSockets.
9to5Mac. OpenAI has new voice models that reason, translate, and transcribe as you speak, 7 May 2026, accessed 2026-05-31. https://9to5mac.com/2026/05/07/openai-has-new-voice-models-that-reason-translate-and-transcribe-as-you-speak/. Secondary press source corroborating the May-2026 split of the Realtime family into reasoning, translation and transcription variants; used only to date the release, with the capability claims traced to first-party reference 2.