How real-time speech translation works at production scale. The architectural fork every project hits: cascaded (ASR → MT → TTS, three vendors, three logs) versus end-to-end speech-to-speech (Meta SeamlessM4T v2, DeepL Voice, Google Translatotron). The six trade-offs that define every translation system. The latency budget that decides whether the experience feels natural or broken. Written from the platforms we have shipped: Translinguist (16+ language pairs across telehealth, legal, live events), VOLO.live (Black Hat USA 2025, 22,000 participants, six languages), Rafiky (conference interpretation).
Real-time speech translation is the process of converting spoken audio in one language into spoken audio or text in another language with low enough latency that two-way conversation can flow naturally. Two architectures dominate in 2026: cascaded ASR → MT → TTS and end-to-end speech-to-speech using a single multilingual model.
A live video translation system captures audio from a WebRTC session, runs streaming automatic speech recognition to produce a transcript, translates the transcript with a machine-translation model, generates target-language speech with text-to-speech, and publishes the translated audio back into the session as a parallel audio track. End-to-end glass-to-glass latency typically lands at 1.2 to 3.0 seconds.
Cascaded chains three vendors and three models. Each stage emits inspectable text or audio, which makes per-stage observability and PII redaction simple. End-to-end speech-to-speech takes audio in and emits audio out from a single model. End-to-end wins on latency and prosody preservation. Cascaded wins on accuracy for technical vocabulary, vendor flexibility, and audit-trail clarity. Most production systems still ship cascaded in 2026. Not sure yet whether you need a translator, an interpreter, or AI at all? Read our 2026 decision tree first.
Four shapes of real-time speech translation dominate the 2026 landscape. Each one fits a different architecture, vendor stack, and latency ceiling.
Multi-party live translation for conferences, summits, online events. Cascaded ASR → MT → TTS, optional human-interpreter fallback. Scale 100 to 22,000 participants across 4 to 8 simultaneous languages. Reference: VOLO.live at Black Hat USA 2025, 22,000 participants, six languages, sub-three-second end-to-end latency.
Two-way translated conversations for medical, legal, customer-support workflows. HIPAA BAA chain across ASR / MT / TTS vendors. Voice cloning preserves practitioner identity. Reference: Translinguist for telehealth interpretation across 16+ language pairs.
Display-only translation. Captions over the original audio. Lower latency target (sub-1.5 seconds end-to-end, no TTS stage). Used for broadcast, accessibility compliance, hearing-impaired audiences. KUDO, Wordly, Interprefy lead this segment.
Developer-focused SDK / API. Deepgram, AssemblyAI, AWS Transcribe + Translate, Azure Speech Translation, Google Cloud Speech-to-Speech. Used inside chat apps, meeting platforms, customer-service tools.
Telehealth interpretation across 16+ language pairs. Live multilingual translation at Black Hat USA 2025. Conference interpretation platform mixing AI and human interpreters. Three production builds running today across very different shapes.
Architecture. Cascaded ASR → MT → TTS with voice cloning at the TTS layer.
Outcome. 16+ language pairs in production. Sub-two-second cascaded end-to-end latency. Voice cloning preserves speaker identity. The HIPAA-compliant BAA chain runs from the cloud provider through every model vendor. PHI redaction between ASR and MT stages prevents identifiers from being persisted in translation logs.
Architecture. Hybrid Cloud + self-host. Cascaded translation with conference-scale autoscale on the public broadcast tier. Self-hosted speaker tracks for control over voice cloning.
Outcome. 22,000+ participants at Black Hat USA 2025. Six-language live translation. Sub-three-second end-to-end latency at peak load. Scale shape: 22K listeners × 6 languages = 132K simultaneous translation streams at keynote peaks. Per-language audio tracks generated server-side and distributed via CDN edge cache.
Architecture. Cascaded ASR → MT → TTS with human-interpreter fallback option for high-stakes sessions.
Outcome. Production conference interpretation platform serving multi-language events at scale. Architecture mirrors KUDO and Interprefy in shape: cascaded AI for routine multilingual broadcast, human interpreters bookable on demand for high-stakes sessions. The mixed AI / human model is the dominant 2026 pattern for premium conference work.
Three architectural paths for shipping a real-time speech translation product. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether translation is the product or a feature of another product. Still deciding whether you need AI at all, or a human translator, or a human interpreter? Read the 2026 decision tree first. The framework below picks up after that decision.
Wins when: Translation is the product. Custom voice cloning required. Regulated industry (HIPAA, EU AI Act high-risk). Branded experience embedded in your own product. Multi-tenant SaaS plays. Low-resource language pairs not covered by major platforms. Custom domain glossaries at the heart of the offering.
Cost shape: $8K-$40K build over 1-3 months. $500-$2K monthly operations. Per-minute cost $0.03-$0.17 depending on stack tier.
Archetypes: Translinguist, VOLO.live, Rafiky.
Wins when: Conference interpretation use case. Standard language pairs (top 30). No custom voice cloning. No custom domain glossary. No in-house engineering capacity. Willing to live within the vendor's UX and feature set.
Cost shape: $5K-$50K per major event for conference platforms. Per-meeting and subscription pricing varies. Operationally simpler than building. See the four-vendor public-data comparison.
Wins when: Conference and event work spans routine sessions (AI) and high-stakes keynotes (human interpreters). The dominant 2026 pattern for premium events. KUDO, Interprefy, and Translinguist all support this routing model.
Pattern: AI cascaded translation for routine sessions and high-volume broadcast tracks. Certified human interpreters routed for the keynote, regulated proceedings, and multi-language Q&A panels. Event organizers route per session stake.
Cost ranges are 2026-indicative. Implementation specifics — concurrency target, language pair count, compliance scope, voice cloning vs synthetic voices, custom glossary depth — dominate the spread within each tier.
A custom real-time speech translation system costs $20K–$150K to build over 1–4 months. KUDO and Interprefy ship event-ready in days. Hybrid (AI for scale, human for stakes) is the dominant 2026 pattern for premium conference and event work.
Real-time speech translation converts spoken audio in one language into spoken audio or text in another with low enough latency for natural two-way conversation. Two architectures dominate in 2026: cascaded (ASR → MT → TTS, three sequential models, 1.2-2.5 s end-to-end) and end-to-end speech-to-speech (Meta SeamlessM4T v2, DeepL Voice, 200-700 ms end-to-end). Cascaded is the production default for 85% of deployments. End-to-end is rising for premium consumer and conference use cases.
A live video translation system captures audio from a WebRTC video session, runs streaming ASR to produce a transcript, translates the transcript with a machine-translation model, generates target-language speech with text-to-speech, and publishes the translated audio back as a parallel audio track. Listeners subscribe to their preferred language. Total round-trip typically lands at 1.2 to 3.0 seconds.
Cascaded uses three separate models: ASR turns speech into text, MT translates the text, TTS turns translated text back into speech. End-to-end uses one model that takes source audio in and emits target audio out directly. Cascaded offers vendor flexibility, per-stage observability, and PII redaction between stages, at the cost of higher latency. End-to-end offers sub-second latency and prosody preservation, at the cost of vendor lock-in, weaker observability, and lower accuracy on technical vocabulary.
Accuracy depends on language pair and architecture. For top pairs (English ↔ Spanish, French, German, Mandarin), cascaded systems achieve BLEU scores of 38 to 45, near-human on conversational content. For medium-resource pairs (English ↔ Vietnamese, Turkish, Hindi), BLEU drops to 28 to 37. For low-resource pairs, BLEU sits below 28 without domain fine-tuning. Streaming ASR adds 5 to 8 WER percentage points to end-to-end accuracy. Plan for 10 to 20 percent perceived translation quality degradation versus an oracle text-in text-out baseline.
Glass-to-glass latency targets in 2026: cascaded production median 1.4 to 1.7 seconds. End-to-end S2ST 400 to 700 ms. Below 700 ms feels real-time. 700 ms to 1.5 s is conversationally acceptable. 1.5 to 3 s is visibly lagged. Above 3 s feels broken. The biggest latency lever is streaming at every stage. The second biggest is co-located regions for SFU, ASR, MT, and TTS workers.
ASR: Deepgram (Nova-3, Flux), AssemblyAI Universal-Streaming, OpenAI Realtime STT, Whisper Large-v3, Azure, Google Cloud Chirp 2, Speechmatics, NVIDIA Canary. MT: DeepL API, Meta NLLB-3.3B, Microsoft Translator, Google Cloud Translation, GPT-4o-mini and Claude Haiku (prompted), Amazon Translate. TTS: ElevenLabs Multilingual v2, Cartesia Sonic 3, Azure Neural, OpenAI TTS, Google Cloud TTS, Deepgram Aura, PlayHT. End-to-end: Meta SeamlessM4T v2, DeepL Voice, Google Translatotron 3, OpenAI Realtime API multilingual, Gemini Live.
Cascaded per-minute cost ranges from $0.03 (budget self-host stack) to $0.17 (premium with voice cloning), single language pair. Each additional target language adds $0.03 to $0.15 per minute. End-to-end speech-to-speech self-hosted runs $0.015 to $0.040 per minute. Conference platform pricing (KUDO, Interprefy) runs $5K to $50K per major event. Custom build costs $20K to $150K depending on scope. Ongoing operations $2K to $10K per month.
Voice cloning preserves the source speaker's voice characteristics in the target language. Listeners hear their own doctor or speaker speak their language directly rather than a generic synthetic voice. ElevenLabs Instant Voice Cloning (zero-shot, 30 seconds of audio), ElevenLabs Professional Voice Cloning (trained, 30+ minutes of audio), and Meta SeamlessM4T v2 Expressive all support voice preservation. Use it for premium 1:1 telehealth and consumer products. Skip it for evidentiary recordings, regulated workflows where synthetic voices are mandated, or when source consent is unclear.
Four layers. (1) Self-host on BAA-able infrastructure. (2) Use BAA-compliant ASR (Deepgram Enterprise, AssemblyAI Enterprise, Whisper self-host), BAA-compliant MT (Azure Translator, AWS Translate, NLLB-3.3B self-host), BAA-compliant TTS (ElevenLabs Enterprise, Cartesia Enterprise). (3) PHI redaction between ASR and MT stages. (4) Audit logging, encrypted storage with six-year retention, RBAC, automatic session termination. Fora Soft has shipped HIPAA-compliant translation deployments (Translinguist).
From 2 August 2026, translation systems serving EU users must: (1) Disclose AI translation at the start of every session, (2) Log the interaction with timestamps, source and target language, and model versions, (3) Document foundation-model provider compliance, (4) Report on bias and accuracy for high-risk uses. Engineering effort: 4 to 8 engineer-weeks for a standard-risk use case.
No, and yes. AI undercuts human interpreters on cost from the first hour and matches them on quality for routine multilingual broadcast in top language pairs. Human interpreters remain mandatory in 2026 for certified court-of-record proceedings, high-stakes medical procedures, top-tier conference keynotes, regulated immigration and asylum hearings, and most jurisdictions' legal requirements. The dominant 2026 pattern is hybrid: AI for scale, human for stakes, routed per session by KUDO, Interprefy, Translinguist, and Rafiky-style platforms.
Vendor latency claims usually measure one stage in isolation. End-to-end production latency stacks all stages plus network: audio transport, VAD, ASR, ASR commit, MT, TTS, outbound transport, playout. A cascaded production pipeline lands at 1.2 to 2.5 seconds end-to-end even when each stage hits its individual sub-second mark. Architecture choice and re-translation discipline matter more than any vendor's marketing-quoted latency.
Each piece below picks up where this pillar ends. The decision tree if you have not picked AI yet. The vendor synthesis if you have. The engineering playbook if you are building. The streaming engineering guide if your translation rides on a live stream.
If you are scoping a real-time speech translation system and want a second opinion on cascaded-versus-end-to-end, the vendor stack, language-pair complexity, the voice-cloning consent shape, or the EU AI Act compliance approach, write us. A senior engineer with shipped translation platforms in production replies within 24 hours.
Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.