Speech-to-speech translation is the chained or end-to-end process of taking spoken audio in one language, converting it to text, translating that text into a target language, and rendering it back as captions or synthesized speech, all in real time. In 2026, four systems lead enterprise procurement: DeepL Voice, KUDO AI Speech Translator, Interprefy Aivia, and Meta SeamlessM4T-v2. They differ in what they publish, in latency, and in cost. The differences matter more than the marketing suggests.

Key takeaways

Public accuracy data is asymmetric. Meta SeamlessM4T-v2 publishes rigorous FLEURS and CVSS benchmarks. DeepL publishes selected text-translation BLEU scores. KUDO and Interprefy publish marketing-tier claims without methodology.

First-chunk latency, not total latency, is the user-experience metric. Below 800 milliseconds feels live. Above 2 seconds, listeners talk over the translation.

Cost per source-audio minute spans 30x across the field, from roughly $0.04 self-hosted to $1.25 on event-platform tiers.

Cascaded pipelines still beat end-to-end on debuggability and compliance. End-to-end wins demos. Cascades win procurement.

The fastest path to a vendor decision is a private pilot on your own audio. Public data narrows the shortlist. Private data picks the winner.

Why this comparison exists

Buyers searching “speech to speech translation” in 2026 land on three kinds of pages: vendor marketing, academic papers, and editorial roundups. Nothing in the public domain consolidates the four leading systems into one view. The vendor marketing is not falsifiable. The academic papers cover one or two of the systems and miss the rest. The editorial roundups omit numbers entirely.

This article fills that gap. Each section reads what each vendor and each academic source has actually published, flags the methodology gaps, and tells you what the public record will and will not support as a procurement claim.

If you have not yet decided whether AI fits your use case at all, our interpreter vs translator vs AI decision tree walks the call. For the architecture context, see the real-time speech translation pillar.

Need a private evaluation on your own audio?

30 minutes with a senior engineer who has shipped real-time translation on WebRTC, LiveKit, and custom SFUs across 250+ Fora Soft video projects since 2005.

Book a 30-min call → WhatsApp → Email us →

Methodology: what we synthesized

For each of the four systems, we pulled every publicly available figure on accuracy, latency, and cost from named sources: vendor pages and tech blogs (DeepL, KUDO, Interprefy, Meta AI), academic papers (the SeamlessM4T paper and the Seamless streaming paper on arXiv), Hugging Face model cards, the Open ASR Leaderboard, and industry coverage from Slator and Multilingual.com.

Three gaps shape what the public record can support:

Methodology mismatches. Vendors report on different test sets. SeamlessM4T reports on FLEURS and CVSS. DeepL publishes selected pair-wise quality scores on WMT-derived benchmarks. KUDO and Interprefy publish demonstration metrics with no standardized test set. The numbers below are attributed and comparable to themselves, not comparable to each other across vendors.

Metric mismatches. Word error rate is the standard ASR metric. BLEU, chrF, and COMET are text-translation metrics. ASR-BLEU is the BLEU score computed on the output of an ASR pass over the translated audio, used in the SeamlessM4T paper to evaluate end-to-end speech-to-speech translation. Different metrics measure different things, and a “best WER” comparison across vendors who report different metrics is incoherent.

Vendor-self-reported claims. Most marketing-page accuracy claims come without test-set disclosure, without confidence intervals, and without third-party verification. We include these where they exist for completeness but flag them clearly.

Nothing in the public record tells you how these four systems perform on your audio, in your language pair, with your glossary. Run a private pilot on 2 to 4 hours of representative audio before committing.

Accuracy: what each vendor publishes

Meta SeamlessM4T-v2

Meta AI published two foundational papers on this model family: the original SeamlessM4T paper (arXiv:2308.11596, August 2023) reporting FLEURS ASR-BLEU and CVSS BLEU scores, and the Seamless paper (arXiv:2312.05187, December 2023) covering streaming and expressive variants with prosody preservation MOS scores. The Hugging Face model card for facebook/seamless-m4t-v2-large publishes FLEURS ASR word-error-rate numbers per language.

What the public record supports. SeamlessM4T-v2 is the most reproducibly benchmarked of the four systems. The paper publishes both metrics and the code. A reader with a GPU can rerun the evaluation. That alone is the strongest single argument for shortlisting it.

What the public record does not cover. FLEURS and CVSS lean toward studio audio. Performance on real conference audio with echo, audience noise, and code-switching is not measured in any public source we found.

DeepL Voice

DeepL Voice launched in late 2024 as the streaming speech-to-speech product. The product page describes “real-time fluency” and “near-human quality” on supported pairs but does not publish a WER table.

DeepL’s text-translation track record is the strongest among the four. Public benchmarks on WMT test sets place DeepL Translate at the top of the field for major European pairs (COMET and BLEU scores tracked in their Tech Blog research posts since 2023). These numbers do not transfer cleanly to streaming speech, which adds ASR error on top of MT error, but they point to a strong MT stage in the Voice cascade.

Slator’s coverage of DeepL Voice in 2024 and 2025 quotes selected listener-preference numbers from DeepL-commissioned studies. The methodology in those studies is not fully disclosed.

What the public record supports. European-pair quality is likely the strongest of the four, on the strength of the text-translation track record. Clean streaming API, transparent pricing.

What the public record does not cover. Voice-product WER on standardized test sets. No comparable numbers exist.

KUDO AI Speech Translator

KUDO has published marketing claims about “in-test accuracy comparable to human interpreters” in press releases and conference talks. The claims do not disclose the test set, the language pair, the listener methodology, or the comparison group.

Slator’s coverage of KUDO from 2023 to 2025 describes the product positioning and pricing model but contains no independently verified accuracy numbers.

What the public record supports. KUDO’s product is widely deployed at large-scale RSI events. That is itself a signal of operational reliability and listener acceptability in moderated professional settings.

What the public record does not cover. Any standardized accuracy metric on a public test set. The data does not exist in the public record.

Interprefy Aivia

Interprefy launched Aivia in 2023 with marketing claims about accuracy improvements over previous generations of the platform. Specific numbers were not released. Slator’s coverage from 2023 to 2025 describes the product but contains no verified accuracy numbers.

Interprefy publishes selected customer case studies with NPS-style satisfaction scores, not WER or BLEU.

What the public record supports. Interprefy serves a large RSI customer base. The compliance posture (HIPAA, EU data residency) is well-documented.

What the public record does not cover. Aivia’s WER, BLEU, or comparable accuracy metric on a standardized test set.

Consolidated accuracy view

System Best public accuracy artifact Comparable to other systems?
Meta SeamlessM4T-v2 FLEURS ASR-BLEU & WER, CVSS BLEU Yes, to other FLEURS / CVSS-reported models
DeepL Voice Text-translation BLEU/COMET on WMT Partially. Text only, no ASR component
KUDO AI Speech Translator Vendor-claim “comparable to human” No. No methodology disclosed
Interprefy Aivia Vendor-claim accuracy improvements No. No methodology disclosed

The asymmetry is itself the finding. Two of the four leading vendors do not publish independently verifiable accuracy numbers. A buyer treating the four as equivalent on the basis of marketing claims is comparing two known quantities to two unknowns.

Adjacent benchmarks worth knowing

These are not part of the four-system comparison but anchor field-wide expectations:

  • Whisper-large-v3 reaches 8% to 12% WER on FLEURS for high-resource languages, 20% to 35% on low-resource pairs like Tamil, Swahili, Kazakh. The Open ASR Leaderboard on Hugging Face tracks this in near-real-time.
  • Google Cloud Chirp-2 leads on Asian-language coverage (Vietnamese, Thai, Indonesian) and on long-context translation quality.
  • Microsoft Azure Speech Translation offers a single-endpoint cascaded ASR + MT + neural TTS with voice selection and Personal Voice for cloning.
  • Deepgram Nova-3 reports sub-5% WER on English call-center audio and 6% to 9% on noisy medical audio.
  • OpenAI GPT-4o Realtime delivers roughly 300 ms first-chunk latency with strong accented-English performance.

Latency: vendor claims vs production reality

Every vendor quotes “real-time” or “near-instant” without specifying which point in the pipeline. The number that actually predicts user satisfaction is first-chunk latency, the time from when the speaker starts talking to when the listener hears the first translated word.

System First-chunk claim Field-observed 2026
Meta SeamlessM4T-v2 (streaming) 300 to 600 ms (paper) 800 to 1,500 ms incl. infra
DeepL Voice “Real-time,” sub-second implied Not independently measured
OpenAI GPT-4o Realtime (reference) ~300 ms 300 to 500 ms
Deepgram Nova-3 (reference) 300 to 500 ms ASR 600 to 800 ms full cascade
Interprefy Aivia “Comparable to human” 1 to 2 seconds in field reports
KUDO AI Speech Translator “Near-instant” 2 to 4 seconds in field reports

What the standards say

The AIIC technical standards for remote simultaneous interpretation cite 3 to 5 seconds end-to-end as the upper bound for sustainable RSI. ITU-T G.114 recommends one-way delay below 150 ms for unimpaired conversation, with up to 400 ms acceptable but degraded. Human-perception research on conversational turn-taking shows that anything past 1.2 seconds noticeably affects audience engagement.

Below 800 ms first-chunk feels live. Between 800 ms and 1.5 seconds works for keynotes and lectures. Above 2 seconds, listeners start to talk over the interpretation and the experience collapses.

Hidden latency costs

The vendor number is rarely the number a user experiences. Almost every production deployment carries WebRTC jitter buffer (80 to 200 ms), backend queue (50 to 100 ms), and TLS round trips between server and vendor (20 to 80 ms per hop, often two or three hops in). Budget 500 to 700 ms of overhead on top of any vendor figure and measure end-to-end from real client devices in your target regions before trusting any marketing claim.

Cost per minute: the cleanest dimension

Pricing data is the least ambiguous of the three dimensions. Vendors publish list prices on product pages (DeepL, OpenAI, Deepgram). Self-hosted models price out to GPU rental (SeamlessM4T). Event-tier vendors publish list prices for standard packages, with negotiation expected (KUDO, Interprefy).

System List price per minute of source audio Notes
DeepL Voice (API) ~$0.10 to $0.20 Annual commits drop to ~$0.08
KUDO AI Speech Translator ~$0.80 to $1.25 Event-tier list; per-attendee fees extra
Interprefy Aivia ~$0.60 to $1.00 Event-tier list; comparable structure
Meta SeamlessM4T-v2 (self-hosted) ~$0.02 to $0.05 AWS A100 on-demand, amortized

Where the hidden costs sit

KUDO and Interprefy bundle interpreter-management features and event-platform UX into the price. If you are running standalone events, that bundle is justified. If you are embedding translation into your own product, you are paying for features users never see.

Per-language fees are the most common surprise. Most vendors charge per language pair, not per session. A four-language event is four bills. Some vendors charge for the listener tier separately, on a per-attendee-hour basis. Overage charges on multi-day commits run 1.5x to 3x the contracted rate.

Self-hosted economics

A single A100 80GB instance at AWS on-demand pricing serves roughly 60 to 80 concurrent streams of SeamlessM4T-v2. The fixed monthly cost is the same whether you run one stream or a hundred, so the per-minute number drops as utilization rises. Below 60 concurrent streams sustained, cloud APIs win on total cost. Between 60 and 150, it is a judgment call, usually decided by compliance rather than dollars. Above 150 concurrent streams sustained, self-hosting pays back in 6 to 9 months, assuming the DevOps capacity for GPU fleet management is in place.

Custom Whisper plus GPT-4o stack

A custom build on Whisper-large-v3 for ASR, GPT-4o for translation, and ElevenLabs Multilingual v2 for TTS, running through a LiveKit agent, lands at roughly $0.08 to $0.14 per source-audio minute at moderate volume. Competitive with DeepL Voice on cost and gives full control over the model swap path. Fora Soft has shipped this pattern across multiple production deployments since 2024.

System-by-system: strengths, weaknesses, and where to deploy

DeepL Voice

Strengths. Best-established text-translation track record among the four, transferring to a strong MT stage in the Voice cascade. Clean streaming API. Transparent and competitive pricing. Documented GDPR compliance with EU data residency.

Weaknesses. Smaller language roster than Google or SeamlessM4T. Voice-product WER on standardized test sets is not published. No voice cloning in the production tier as of mid-2026. Limited custom-glossary support for live streams.

Reach for DeepL Voice when: dominant pairs are major European, the buyer cares about GDPR cleanly, and a single vendor relationship matters.

KUDO AI Speech Translator

Strengths. Strong end-user experience. Tight integration with the KUDO RSI platform, making hybrid human-plus-AI workflows one click away. Strong event-management features. HIPAA-grade infrastructure available on enterprise tier.

Weaknesses. No published WER or BLEU on standardized test sets. Field reports place latency at the high end of the band. Event-platform pricing makes embedding into a custom product awkward.

Reach for KUDO when: running events end-to-end on the KUDO platform already, the buyer is paying for the platform rather than the engine, and the latency band is acceptable for the room.

Interprefy Aivia

Strengths. Mature compliance story including HIPAA and a documented EU data-residency option. Strong RSI customer base means the vendor understands enterprise-event use cases. Clean Aivia API with good documentation.

Weaknesses. No published WER or BLEU on standardized test sets. Latency under load is reported as variable in field reports. Smaller language roster than DeepL or Google.

Reach for Interprefy when: a single vendor for human RSI plus AI fallback is the goal, events are predominantly European, and the latency variance is acceptable for the UX.

Meta SeamlessM4T-v2

Strengths. Most rigorous public benchmarks of the four. Lowest published latency, by a clear margin. Open weights mean full data residency and no per-minute marginal cost past the GPU bill. Voice cloning through SeamlessExpressive is comparable to ElevenLabs at zero incremental cost. Roughly 100 speech inputs, 35 speech outputs, 200+ text languages.

Weaknesses. You own the deployment. GPU operations, model serving, scaling, monitoring, fallback. New language support lags closed APIs by months. The “free” model carries six-figure annual operational overhead at meaningful scale.

Reach for SeamlessM4T when: volume justifies a couple of A100s, data residency is mandatory, or voice-preservation at scale is required and ElevenLabs is not an option.

How to read these numbers if you’re choosing a vendor

The right system depends on the room.

Internal corporate webinars (1 to 4 languages, low stakes). DeepL Voice on a single API endpoint, with a custom glossary on product names. Cost per minute is the floor for this category. Accuracy is fine. If GDPR matters, this is the cleanest path among the closed-source options.

Paid public events (high stakes, audience pays attention). Interprefy or KUDO. The bundled event-platform features earn their keep at this tier, and the buyer is paying for the platform, not the engine. Run the pilot on the actual room with actual moderators and time-trial the latency. Do not rely on vendor accuracy claims for the procurement case; do rely on customer references.

Multi-day conferences (cost compounds). SeamlessM4T-v2 self-hosted, or hybrid: SeamlessM4T on the dominant pair, a closed-source vendor on the long-tail pairs. The amortization math is what changes at this scale.

24/7 product-embedded translation (build vs buy). Below 60 concurrent streams sustained, cloud APIs win on total cost. Between 60 and 150, it is a judgment call, usually decided by compliance. Above 150 concurrent streams sustained, self-hosting pays back in 6 to 9 months.

For the framework on when to use AI at all versus a human interpreter, see our decision tree. For the engineering playbook on the four architectures, voice cloning consent, compliance posture, integration patterns, and the five pitfalls that derail interpretation projects, see our top 5 AI tools for real-time language interpretation.

How Fora Soft fits in

We build custom real-time translation and interpretation systems. Since 2005, we have delivered 250+ video projects with a 20+ year track record of shipping production-grade real-time video and AI infrastructure. The translation work includes Translinguist (multilingual event platform), Volo (real-time translation for healthcare and education), and Rafiky (remote simultaneous interpretation for international conferences).

If you are scoping a build or migrating off a vendor that is not keeping up, book a 30-minute call with a senior engineer. Bring an architecture diagram, a vendor quote, or just a napkin sketch. For the educational framework on interpreter vs translator vs AI, see our decision tree. For the architectural reference, see our real-time speech translation pillar. For multilingual video-call design patterns, see our multilingual translation in video calls reference. Work directly with the language-interpretation team.

Let’s pressure-test your interpretation stack

30 minutes, one senior engineer, zero fluff. Bring your latency number, your vendor shortlist, or just a napkin sketch.

Book a 30-min call → WhatsApp → Email us →

FAQ

What is speech-to-speech translation?

Speech-to-speech translation is the real-time process of converting spoken audio in one language into spoken audio or captions in another language. Modern systems use either a cascaded architecture (speech-to-text, then machine translation, then text-to-speech) or an end-to-end model. Latency in 2026 ranges from 300 milliseconds to 4 seconds depending on the system.

What is the most accurate real-time speech translation system in 2026?

The public data does not let buyers rank the four leading systems on accuracy directly. Meta SeamlessM4T-v2 publishes the most rigorous benchmarks (FLEURS ASR-BLEU and CVSS in the SeamlessM4T paper). DeepL publishes marketing-tier claims for the Voice product but a strong text-translation BLEU track record on WMT. KUDO and Interprefy do not publish standardized accuracy metrics. Run a private evaluation on your own audio before committing.

How accurate is real-time speech translation in 2026?

On standardized test sets like FLEURS, the best public numbers from systems like SeamlessM4T-v2 and Whisper-large-v3 land at 8% to 12% WER on high-resource languages and 20% to 35% on low-resource pairs. Real conference audio with echo and overlap is poorly represented in published benchmarks. Expect 1.5x to 2x the published WER on real-world audio in your domain without glossary tuning.

Why is it so hard to compare DeepL Voice, KUDO, Interprefy, and SeamlessM4T head-to-head?

The four vendors do not report on the same test sets, do not use the same metrics, and do not all publish independently verifiable accuracy numbers. SeamlessM4T-v2 publishes on FLEURS and CVSS. DeepL publishes selected text-translation scores. KUDO and Interprefy publish marketing-tier claims without methodology. A direct comparison requires running all four on your own audio with your own ground truth.

Can AI replace human simultaneous interpreters?

For low-stakes, internal, real-time speech, often yes. For legal proceedings, medical-of-record interactions, or high-stakes diplomatic and conference settings, no. Our interpreter vs translator vs AI decision tree walks the call.

How fast is “fast enough” for live translation?

First-chunk latency under 800 milliseconds feels live to listeners. Between 800 ms and 1.5 seconds works for keynotes and lectures. Above 2 seconds, participants talk over the interpretation and the experience collapses. AIIC standards for sustainable RSI cite 3 to 5 seconds end-to-end as the absolute upper bound.

Is AI interpretation HIPAA-compliant?

Google Cloud, Azure, AWS, and Deepgram all offer BAAs. DeepL and Interprefy offer HIPAA tiers. OpenAI offers limited BAA on specific tiers. For maximum safety, self-hosted SeamlessM4T or Whisper inside your own HIPAA-compliant VPC removes the vendor from the trust boundary.

How much does adding real-time interpretation cost per minute?

Budget $0.04 to $0.20 per translated minute on cloud APIs at moderate volume. $0.04 self-hosted on saturated GPUs. $0.60 to $1.25 on event-vendor platforms that bundle interpreter management and listener UX.

Conclusion

Real-time speech translation in 2026 is a four-system shortlist for buyers, a four-architecture choice for builders, a strict latency budget, and a consent story for voice cloning.

Meta SeamlessM4T-v2 publishes the most rigorous public benchmarks and gives full data residency in exchange for owning the deployment. DeepL Voice publishes a strong text-translation track record and clean pricing; the Voice-specific accuracy numbers do not exist in the public record. KUDO and Interprefy publish marketing-tier claims without standardized methodology; their value to a buyer is the event-platform bundle, not the engine.

The single most useful thing a buyer can do with a vendor shortlist is run a two-week private pilot on representative audio. Public data narrows the shortlist. Private data picks the winner.

Read next

Decision tree

Interpreter vs Translator vs AI: 2026 Decision Tree

Five questions before you buy.

Pillar guide

Real-Time Speech Translation for Live Video

Architecture overview and engineering constraints.

Video calls

Multilingual Translation for Video Calls

Design patterns for embedding translation into WebRTC.

Services

Custom AI Language Interpretation

Work with the team that builds the systems.

Published May 2026. Updated quarterly as vendor pricing and benchmark data shift. For corrections or source additions, contact eager2develop@forasoft.com.

  • Technologies