Real-time multilingual translation in a video call — captions and speech translated across languages

Real-time multilingual translation in video calls moved from "sci-fi" to "table-stakes" in 18 months. In 2026 the question isn't whether your call platform can translate, but whether it can do it under 800 ms, in 40+ languages, with speaker turn-taking, domain vocabulary, and compliance controls. Most vendors only win at two of those four dimensions.

The 2026 multilingual video-call shortlist: Zoom Translated Captions, Google Meet adaptive audio, Microsoft Teams Intelligent Recap, Interprefy, KUDO, Interactio, and Meta SeamlessM4T-v2. End-to-end speech-to-speech now clears <700 ms in 36 languages with voice preservation — cascaded stacks still run 800 ms–2 s but win on language coverage (100+).

Fora Soft has been building WebRTC video products since 2005 and shipping AI translation features in them since Whisper landed. This guide ranks the seven tools our product teams actually integrate in 2026 — with the latency math, the language coverage, and the procurement traps we've hit.

Key Takeaways for 2026

  • Two architectures, two latency classes. Cascaded ASR→MT→TTS (800 ms–2 s) still dominates enterprise. End-to-end speech-to-speech (400–700 ms) is catching up via Meta SeamlessM4T-v2 and Google's Translatotron.
  • Big-3 native captions are free. Zoom, Teams, and Google Meet all ship live caption translation in 40–100+ languages at zero marginal cost for paid tiers. Don't integrate a third-party tool unless you need interpreter mode, custom glossaries, or deeper latency control.
  • Interprefy, KUDO, and Interactio own "interpreter-in-the-loop." When accuracy matters (legal, diplomatic, medical consent), human interpreters still beat AI — and these platforms are the conduit.
  • Meta SeamlessM4T-v2 and DeepL Voice changed the 2025 open-source line. 101-language speech input, 100+ text output, 36-language voice-preserving output. Open weights. If you build your own, start here.
  • Compliance is the silent tiebreaker. HIPAA BAA, EU data residency, and "no vendor training on our audio" clauses are where AAA-brand bids are won or lost in 2026.

Why Fora Soft on real-time translation

We've shipped multilingual video features in products ranging from BrainCert's WebRTC virtual classroom (100K+ customers, 500M+ classroom minutes) to HIPAA-compliant telemedicine for US private practices to cross-border legal-deposition platforms. We've integrated every tool on this list at least once and migrated between them at least twice.

Pick real-time AI when: your latency budget is < 500ms and accuracy needs to be > 90%. 2026 stacks hit both.

This guide is a procurement playbook, not a listicle. We tell you when each tool wins, when it silently fails, and what the year-two cost curve looks like once your call volume scales.

Ready to add real-time translation to your video product?
Book a 30-minute architecture call with our team.
Book a call →

The two architectures you're choosing between in 2026

Cascaded pipeline (ASR → MT → TTS): The industry default since 2020. Audio hits a speech-recognition model (Whisper, Google Chirp, Deepgram Nova-3), the transcript is translated by an MT model (DeepL, Google Translate, NLLB), and an optional TTS layer speaks it out. Pros: mature, composable, interpretable. Cons: errors compound, latency floor ~800 ms–1.5 s, loses prosody and speaker identity.

End-to-end speech-to-speech (S2ST): One model ingests audio and emits audio in the target language. Meta's SeamlessM4T-v2 (Aug 2024), Google's AudioPaLM and Translatotron 2, and NVIDIA's Voice Converter set the 2025–2026 frontier. Pros: 400–700 ms latency, voice preservation, better prosody. Cons: fewer language pairs, harder to debug, heavier infra.

In 2026, enterprise defaults are still cascaded (because auditability matters and interpreter-human fallback is cleaner). Consumer-facing products and AI assistants are shifting to end-to-end. If your product is both — run cascaded for the caption track and S2ST for the voice track.

1. DeepL Voice for Meetings

DeepL launched Voice for Meetings in late 2024 after a decade as the "quality leader" in text MT. In 2026 it supports real-time translation in 33 languages for live video, delivered as captions through Teams, Zoom, and Google Meet plugins. Latency is ~700–1,100 ms end-to-end; accuracy on European-language pairs is consistently 2–4 BLEU points above Google Translate in our internal tests.

Skip pivot-through-English when: you have direct language pairs available. Direct translation beats pivot for low-resource languages.

Pricing (2026): DeepL Business starts at $22.49/user/month (annual); Voice for Meetings is bundled in Business Pro (~$49/user/month) and DeepL Pro for Teams. Enterprise adds custom glossary, single sign-on, and EU data residency.

Pick DeepL Voice when you run European multilingual meetings, you care more about nuance than language breadth, and you want a fast plug-in rather than a full platform. Best for enterprise sales, consulting, and legal firms working across 5–10 European languages.

2. Interprefy — Interpreter platform with AI mode

Swiss-based Interprefy is the biggest remote-simultaneous-interpretation (RSI) platform by volume — used by the UN, World Economic Forum, and thousands of corporates. Its 2025 "AILO" mode adds AI-only live translation in 80+ languages for lower-stakes meetings, and routes to human interpreters for high-stakes sessions via the same interface. Integrates with Zoom, Teams, and via its own web client.

Pricing (2026): AI-only tier from ~$190/event (small meetings); hybrid AI+interpreter from ~$900/event; enterprise API access and dedicated interpreter rosters priced per engagement. Interprefy also offers a white-label for ISVs.

Pick Interprefy when you need a mixed AI/human interpreter workflow in the same call — conferences, diplomatic meetings, legal proceedings, medical interviews. The handoff from AI to human interpreter is the feature nobody else ships this well.

3. KUDO AI — Purpose-built multilingual conferencing

KUDO started as an RSI platform and in 2024–2025 pivoted to "AI Meetings" — full-stack multilingual video conferencing with real-time translation, transcription, and summarization in 50+ languages. Latency ~1.2–1.8 s, with speaker-segmented transcripts, glossary import, and enterprise SSO/SCIM. The 2026 push is "KUDO AI for Salesforce" and deep ERP integrations.

UX priority: live captions, speaker tags, and language-switch indicators drive adoption more than raw accuracy.

Pricing (2026): Team tier from ~$15/user/month (up to 20 languages); Business tier ~$40/user/month (full 50+ languages, advanced analytics); Enterprise includes custom data residency, audit logs, and white-label embedding. Per-minute RSI hours bundled in higher tiers.

Pick KUDO when you want a dedicated multilingual conferencing product rather than bolting translation onto a general-purpose call app. Strongest for regulated industries, international associations, and cross-border sales teams.

4. Microsoft Teams with AI Translator

In 2026, Microsoft Teams includes Interpreter in Teams (GA February 2025) — real-time voice translation with speaker voice simulation in 9 languages, layered on top of the existing live captions in 40+ languages. Powered by Microsoft's ASR and translator stack, it's part of the Copilot for Microsoft 365 bundle.

Pricing (2026): Live captions and caption translation are bundled in Teams Essentials ($4/user/month) and up. Interpreter in Teams (voice-level) requires Copilot for M365 (~$30/user/month add-on). Enterprise agreements typically include both.

Pick Teams when your organization is already on Microsoft 365 and your translation need is meeting-room-scale rather than public-event-scale. Compliance posture (Microsoft Purview, data residency, audit) is best-in-class for regulated enterprises.
Picking between native Big-3 and a specialist platform?
Our team has integrated every option on this list — happy to walk through the trade-offs.
Book a call →

5. Zoom Workplace AI Companion — Translated Captions & Interpretation

Zoom's 2026 stack splits translation into two features: Translated Captions (free on paid plans, covering 11 languages bidirectionally + English-to-any for 35 languages) and Language Interpretation (human interpreters assigned to separate audio channels, available on Business plans and up). The AI Companion 2.0, which launched in late 2024, added real-time summary and action-item extraction in the translated language.

Common failure mode: ignoring privacy & data-residency. GDPR, HIPAA, and regional rules apply to translation data.

Pricing (2026): Translated Captions bundled on Zoom One Business ($21.99/user/month) and above. Language Interpretation is included but requires you to bring interpreters. AI Companion 2.0 is free with any paid Zoom plan — a genuine 2025 change.

Pick Zoom when your call platform is already Zoom and you need caption-level translation for a global customer base. The 2025 free-AI-Companion move makes it the highest-value-per-dollar of the Big 3 for mid-market buyers.

6. Google Meet with Live Translated Captions & Adaptive Audio

Google Meet in 2026 supports real-time caption translation in over 100 languages (world-leading breadth) plus the 2025 "Adaptive Audio" feature that detects multiple speakers in a single room and routes each to their own translation stream. Gemini-based note-taking in the translated language is included on Google Workspace Business Plus and Enterprise plans.

Pricing (2026): Translated captions are free on Google Workspace Business Standard ($14.40/user/month) and above. Enterprise Plus ($26.40/user/month) unlocks Gemini AI meeting summaries and translated note-taking. No per-minute translation metering.

Pick Google Meet when you need the broadest language coverage, you're already on Workspace, and you want Gemini's post-call summaries. Best for global education, NGOs, and marketing teams working across 30+ markets.

7. Meta SeamlessM4T-v2 — Open-weights foundation for custom builds

SeamlessM4T-v2 (Aug 2024) is the reference open-source model for multilingual speech translation in 2026: 101 languages for speech input, 96 for text output, 36 for voice-preserving speech output. Its "Streaming" variant cuts end-to-end latency to ~2 seconds with good prosody preservation; the non-streaming "Expressive" mode matches human-interpreter fidelity in A/B tests at ~4–5 s latency. Weights released under a modified CC-BY-NC-SA license (commercial use requires Meta license).

Pricing (2026): Free weights. Inference infra: ~$2,200/month for a single A100 serving ~30 concurrent streams, or ~$0.035/minute on AWS Inferentia2. Commercial production use requires Meta's FAIR research license — contact their team.

Pick SeamlessM4T when you're building a custom video product, you need voice-preserving translation for brand or accessibility reasons, and you have an MLOps team. The right foundation for telehealth, ed-tech, and live-commerce apps at scale.

2026 comparison matrix

Tool Languages Latency Entry price (2026) Best for
DeepL Voice33700–1,100 ms$22.49/user/moEuropean enterprise meetings
Interprefy80+1.5–2 s (AI)~$190/eventAI+human interpreter hybrid
KUDO AI50+1.2–1.8 s$15/user/moDedicated multilingual conferencing
Microsoft Teams40+ (captions), 9 (voice)800 ms–1.5 s$4/user/mo + CopilotM365-centric enterprises
Zoom35+ (captions)1–1.5 s$21.99/user/moMid-market global teams
Google Meet100+1–1.5 s$14.40/user/moGlobal education, NGOs
Meta SeamlessM4T101 in / 36 voice-out~2 s streamingFree weights + infraCustom builds, voice preservation

Decision tree — which tool for which use case

  • Already on Microsoft 365 → Teams + Copilot. No reason to bolt on a third tool unless you need interpreter handoff.
  • Already on Google Workspace → Google Meet. Best language breadth, Gemini post-call summaries.
  • European B2B sales / legal / consulting → DeepL Voice. Best nuance on EU language pairs.
  • High-stakes meeting with interpreter backup → Interprefy or KUDO.
  • Global education, NGO, 30+ markets → Google Meet.
  • Global marketing / cross-border sales on mid-market budget → Zoom (free AI Companion 2.0 is a meaningful 2025 win).
  • Building a dedicated video product (telehealth, ed-tech, live commerce) → SeamlessM4T-v2 on your infra, with a cascaded Whisper+DeepL fallback for languages SeamlessM4T doesn't cover well.
  • Regulated medical/legal with HIPAA/BAA → Teams (with M365 BAA) or a custom stack with controlled data residency. Avoid Google Translate free-tier and Zoom basic for PHI.

Build vs. buy — 2026 unit economics

Scenario: telehealth product, 10K monthly 30-min sessions, 8 languages, US + EU users, HIPAA and GDPR required.

  • Teams embedded: Impossible — Teams doesn't embed in third-party video products. Out.
  • DeepL Voice API + your own WebRTC: ~$0.10 per minute for the DeepL Voice API + $0.005/min infra = ~$31,500/month at 300K minutes. HIPAA coverage requires DeepL Enterprise (custom quote).
  • Google Cloud Speech-to-Text + Translation API + Text-to-Speech (cascaded build): ~$0.024 + $0.02 + $0.016 = $0.06/min = ~$18,000/month. HIPAA BAA available. You maintain the plumbing.
  • SeamlessM4T-v2 self-hosted on AWS Inferentia2: ~$0.035/min = ~$10,500/month + ~$60K one-time engineering + ~$3K/month MLOps burden. HIPAA under your own BAA. Year-2 break-even vs. Google at ~20% cost savings.
  • Interprefy managed: ~$900/event hybrid × many events = typically $30–50K/month at this scale. Wrong tool for embedded product.

At 10K sessions/month, Google's cascaded API wins year 1. SeamlessM4T wins year 2+. If HIPAA BAA complexity is high, stay with Google. If voice preservation is a differentiator (telehealth with same-voice clinician output), Seamless wins regardless.

Case study: cross-border telehealth at 12K MAU

A US-EU telehealth platform we work with serves Spanish, Portuguese, French, German, Italian, and Polish speakers. Clinicians speak English. Requirements: HIPAA + GDPR, sub-1 s caption latency, voice interpretation for consent sections, audit trail.

What we tried first (2023): Zoom SDK with built-in translation. Failed on HIPAA audit — Zoom's BAA didn't cover the translated caption data path.

What we moved to (2024): Custom WebRTC stack on LiveKit + Google Cloud Speech-to-Text + Translation API (cascaded) for live captions + Interprefy hybrid AI/human interpreter for consent sections + full audit log in our HIPAA-covered AWS account.

Outcome (2026): 12K MAU, 180K minutes/month, end-to-end caption latency ~850 ms, Interprefy interpreter escalation in 2.1% of sessions (consent sections only), HIPAA audit clean. Cost: $0.062/min on Google + $0.18/min on Interprefy escalations = ~$12,300/month total. Clinician satisfaction with translated captions: 91% (up from 64% on the Zoom-native stack).

Five pitfalls we've paid for

  1. Assuming vendor BAA covers translation data. It often doesn't. Verify in writing which data paths the BAA covers before you ship a HIPAA product.
  2. Single-speaker models in multi-speaker rooms. Most consumer tools assume one mic, one speaker. Three people in a room → word salad. Use Google Meet's Adaptive Audio, Zoom's speaker-aware captions, or per-speaker audio streams in your custom stack.
  3. No glossary for domain terms. Medical, legal, and technical calls have vocabularies generic MT mangles. DeepL, KUDO, and SeamlessM4T support custom glossaries — use them.
  4. Codec collapse on telephony. PSTN and VoIP codecs strip 30–40% of the acoustic detail needed for good ASR. If callers dial in by phone, accuracy craters. Warn users in the UI or route phone callers to a wider-band SIP trunk.
  5. No fallback. Cloud APIs go down. Vendor outages happen. Plan for a degraded-mode caption path (English-only, or a cached interpreter track) that kicks in inside 30 seconds.
Adding real-time translation to your video product?
We ship multilingual video features in WebRTC products for telehealth, ed-tech, and live commerce.
Let's design the right architecture for your use case.
Book a 30-minute consult →

FAQ

What's the minimum latency for "real-time" translation in 2026?

Captions: under 1 second is the minimum for natural conversation flow. Voice-level translation: 400–700 ms for end-to-end S2ST models, 1–2 s for cascaded pipelines. Above 2 s, participants start talking over each other.

Can AI translation replace human interpreters yet?

For general meetings, sales calls, and most internal communication — yes. For legal proceedings, diplomatic meetings, medical consent, and anything where misinterpretation has real cost — no. Use hybrid (Interprefy or KUDO) that escalates to humans in high-stakes sections.

Which platforms support HIPAA-compliant translation?

Microsoft Teams (with M365 BAA covering the translation data path), Google Workspace Enterprise (BAA available), and custom stacks on Google Cloud Speech/Translate or self-hosted SeamlessM4T under your own BAA. Zoom basic plans do not cover translation data under their BAA — check the fine print.

How many languages can I actually get good quality in?

Top 20 languages (major EU, Chinese, Japanese, Korean, Arabic, Spanish LatAm, Portuguese BR, Hindi): very good. Next 30 (major African, Southeast Asian, Central Asian): usable. Long-tail languages: poor to unusable for voice, acceptable for text. Google Meet has the widest coverage; DeepL has the deepest on EU pairs.

Can I integrate real-time translation into my own app?

Yes — via Google Cloud's Speech-to-Text + Translation API, DeepL API, Microsoft Speech Services, or self-hosted SeamlessM4T-v2. Budget 6–12 weeks for a production-grade cascaded pipeline, plus MLOps for monitoring and fallback. Fora Soft does this regularly.

Does voice preservation work in production in 2026?

Yes, for 36 languages via SeamlessM4T-v2's Expressive mode, and in Microsoft Teams Interpreter for 9 languages. Quality is good enough for most use cases but not perfect — accents and prosody are preserved imperfectly in long-form speech. Use for short utterances and customer-facing moments, not for 60-minute monologues.

What's the typical cost at 10K monthly users?

Big-3 platforms: included in per-seat pricing, often effectively free at enterprise scale. Custom build on Google Cloud: ~$15–25K/month at 10K active users with 30 min average session. Self-hosted SeamlessM4T: ~$10K/month at the same scale, after year-1 engineering investment.

Sum up

In 2026, real-time multilingual translation in video calls is no longer a technology problem — it's a product and compliance problem. The Big 3 (Teams, Zoom, Google Meet) ship native caption translation in 35–100+ languages included with paid plans. DeepL Voice beats them on EU-language nuance. Interprefy and KUDO win when interpreter handoff matters. Meta SeamlessM4T-v2 is the right foundation for custom voice-preserving builds.

The differentiators that decide procurement in 2026 are latency under 1 second, language breadth beyond Europe, compliance (HIPAA, GDPR, data residency), speaker-aware audio in group rooms, and graceful fallback to human interpreters. Accuracy is converging — everyone else is close to DeepL on text and close to SeamlessM4T on voice. The production differentiator is everything that surrounds the model.

Let's design your multilingual video architecture
30 minutes with our video-stack lead. We've integrated every tool on this list at least once.
Book a call →

Comparison matrix: build, buy, hybrid, or open-source for multilingual video calls

A quick decision grid for the four typical 2026 paths. Pick the row that matches your team size, regulatory surface, and time-to-value target — not the row that sounds most ambitious.

ApproachBest forBuild effortTime-to-valueRisk
Buy off-the-shelf SaaSTeams < 10 engineers, generic use caseLow (1-2 weeks)1-2 weeksVendor lock-in, customization limits
Hybrid (SaaS + custom layer)Mid-market, mixed use casesMedium (1-2 months)1-3 monthsIntegration debt, two systems to maintain
Build in-house (modern stack)Enterprise, unique data or compliance needsHigh (3-6 months)6-12 monthsEngineering velocity, talent retention
Open-source self-hostedCost-sensitive, technical teamHigh (2-4 months)3-6 monthsOperational burden, security patching
AI VIDEO
7 Leading Real-Time AI Emotion Recognition Software Solutions in 2026
Hume, Affectiva, MorphCast — accuracy bands, latency, and what clears production.
VOICE AI
Speech-to-Text for Live Streaming: 2026 Benchmarks
Deepgram Nova-3, AssemblyAI, Whisper-v4: P95 latency and word-error rates.
AI AUDIO
7 Best AI Tools to Elevate Audio Apps in 2026
Krisp, Dolby.io, ElevenLabs, Maxine — the stack for clean, multilingual audio.
SERVICES
AI Development Services
How Fora Soft ships production ML into real-time voice, video, and chat products.

The KPIs to track before and after shipping

Outcome metrics drive every multilingual video calls decision — vanity counters do not. Track adoption rate (week-over-week), latency p95, accuracy / quality drift (per-week trend), retention (D1, D7, D30), and revenue impact attributed via clean A/B against a hold-out group. Most teams skip the hold-out and then cannot explain whether the lift is real.

  • Technologies