Knowledge base · Real-time multilingual AI
Real-time speech translation · 2026 guide

A practical guide to real-time speech translation — architecture for live video and audio.

How real-time speech translation works at production scale. The architectural fork every project hits: cascaded (ASR → MT → TTS, three vendors, three logs) versus end-to-end speech-to-speech (Meta SeamlessM4T v2, DeepL Voice, Google Translatotron). The six trade-offs that define every translation system. The latency budget that decides whether the experience feels natural or broken. Written from the platforms we have shipped: Translinguist (16+ language pairs across telehealth, legal, live events), VOLO.live (Black Hat USA 2025, 22,000 participants, six languages), Rafiky (conference interpretation).

20+ years in real-time multimedia · since 2005
|
100K+ conference minutes / month processed
|
16+ language pairs in production · 6-language event scale
Industry recognition · 2019–2025
Top WebRTC Developer
2022
category leader
Best Custom A/V Dev.
2025
category winner
Clutch Global
Spring 2024
top performer
APAC Insider 2024
Innovation
& Excellence
Clutch 5.0 / 30 reviews
Spring 2024
top performer
Quick answer

Real-time speech translation is the process of converting spoken audio in one language into spoken audio or text in another language with low enough latency that two-way conversation can flow naturally. Two architectures dominate in 2026: cascaded ASR → MT → TTS and end-to-end speech-to-speech using a single multilingual model.

A live video translation system captures audio from a WebRTC session, runs streaming automatic speech recognition to produce a transcript, translates the transcript with a machine-translation model, generates target-language speech with text-to-speech, and publishes the translated audio back into the session as a parallel audio track. End-to-end glass-to-glass latency typically lands at 1.2 to 3.0 seconds.

Cascaded chains three vendors and three models. Each stage emits inspectable text or audio, which makes per-stage observability and PII redaction simple. End-to-end speech-to-speech takes audio in and emits audio out from a single model. End-to-end wins on latency and prosody preservation. Cascaded wins on accuracy for technical vocabulary, vendor flexibility, and audit-trail clarity. Most production systems still ship cascaded in 2026. Not sure yet whether you need a translator, an interpreter, or AI at all? Read our 2026 decision tree first.

Topics & use cases covered in this guide

Cascaded, end-to-end, hybrid — and the production shapes each one fits.

Four shapes of real-time speech translation dominate the 2026 landscape. Each one fits a different architecture, vendor stack, and latency ceiling.

Under the hood of ASR

How streaming ASR decides when to commit.

Every cascaded translation pipeline starts at the ASR boundary. Three decoder families dominate production in 2026. Each makes a different trade against latency, accuracy, and streamability. Switch between the decoder view and the tuning view to see how vendor choices translate into knobs you control.

CTC · Connectionist Temporal Classification

The simplest streaming decoder.

First end-to-end ASR loss to ship at production scale. Treats output as a sequence of tokens emitted one per acoustic frame with a special blank symbol filling gaps. Decoder reads frames left to right, decides per frame whether to emit a character or a blank. Streamable by construction. Conditional-independence assumption keeps inference simple but caps accuracy on long contexts.

Wins on

Streamability. Inference speed. Simplicity.

Loses on

Long-context accuracy. Token dependencies.

2026 production role

Rare as primary decoder. Common as auxiliary loss alongside attention/transducer heads where CTC alignment helps regularize training and accelerates early decoding.

RNN-T · RNN-Transducer

The streaming workhorse.

Solves CTC's conditional-independence problem with a prediction network on top of the encoder. Joint network combines acoustic features (encoder) with linguistic features (prediction net), then decides to emit a token or wait. Monotonic alignment between input frames and output tokens. Naturally streamable, well-matched to speech.

Production data

Well-tuned RNN-T beats CTC by 10–20% relative WER on long-form English.

Trade-off

Decoder cost. Joint net runs per frame. Prediction state must be cached.

Modern variants

ConvRNN-T, stateful Conformer-RNN-T reduce per-frame cost by replacing prediction LSTM with convolutional or stateless variants. Monotonic alignment preserved.

Attention encoder-decoder

Accuracy at the cost of streamability.

Strongest accuracy on long contexts because attention has unrestricted access to past acoustic features. Standard cross-attention requires the entire input sequence before decoding starts. Production workaround: chunkwise streaming attention with a small lookback window. Triggered attention uses a CTC head to coarsely align then runs attention only at triggered positions.

Chunk size

200 to 1000 ms typical. Dominant latency knob.

Lookback window

320 to 640 ms typical. Larger = lower WER, higher latency.

Triggered attention hybrid

CTC head produces coarse alignment that triggers the attention decoder at token boundaries. Most of attention's accuracy advantage while preserving streamable behavior.

Conformer · 2026 standard encoder

The architecture every modern streaming ASR uses.

Every modern streaming ASR encoder (Deepgram Nova-3, AssemblyAI Universal-Streaming, Whisper Large-v3-Turbo, NVIDIA Canary, Speechmatics Ursa-2) is a Conformer variant. Interleaves multi-head self-attention (global context) with depthwise-separable convolutions (local acoustic patterns). Outperforms pure-transformer and pure-CNN encoders on every public benchmark.

Streaming config

Self-attention with causal mask + 320–640 ms lookahead. Convolutions run chunked.

Two-pass models

Streaming Conformer-RNN-T (first pass) + non-streaming Conformer-attention (rescoring). Fast partials, accurate finals.

Production landscape

All 2026 production-grade streaming ASR vendors run Conformer encoder under different decoder heads. The decoder is the differentiator, not the encoder.

Knob 01 · audio per inference pass

Chunk size.

The unit of audio the encoder ingests before emitting partial activations downstream. 200 to 1000 ms range. Smaller chunks emit partials faster. Larger chunks give self-attention more acoustic context per pass, which lifts WER 1 to 3 percentage points.

Production range

160 to 960 ms. Default ~320 ms.

Interaction

Stacks with jitter buffer. 80 ms buffer + 320 ms chunk = 400 ms ASR floor before any decoding.

When to tune

Lower the chunk size when first-partial latency must be under 200 ms. Raise it when WER on accented/cellular audio exceeds your budget by 2+ points.

Knob 02 · right-context bleed

Lookahead window.

How many milliseconds past the current chunk boundary the self-attention can see. Larger lookahead lifts WER but adds latency directly. 0 ms = strict causal. 640 ms = the production ceiling.

Production range

0 to 640 ms. Default 320 ms.

Trade rule

~1 ms of lookahead adds ~1 ms to first-partial latency.

When to tune

For interactive sub-300 ms targets, set to 0 and absorb the WER hit. For accuracy-critical recording transcripts, run a non-streaming second pass with unlimited lookback instead.

Knob 03 · turn boundary detection

Endpointing aggressiveness.

When the streaming decoder declares the turn complete. Aggressive endpointing fires the MT layer faster but cuts off pauses mid-thought. Conservative endpointing waits longer but adds end-of-utterance latency on every turn.

Sales calls

400 ms silence. Users think out loud.

IVR replacement

250 ms silence. Users are purposeful.

Telehealth intake

600 ms silence. Older callers, longer pauses.

Court of record

800+ ms. Accuracy over speed.

Knob 04 · domain-specific tokens

Custom vocabulary boost.

Up-weight specific tokens in the prediction network. Brand names, drug names, legal terms, technical jargon, named entities. Lifts technical-vocabulary WER 5 to 8 percentage points on domain-specific content.

Format

List of words / phrases per session or per workspace. Vendor-specific syntax.

Limit

Typically 500 to 5000 terms per profile. Too many waters down the boost.

Pitfall

Vendors weight common English words by default. Boosting "Acme" while it sits in a sentence with "ache me" can still misfire. Test on your audio before shipping.

Vendor caveat: vendors quote latency on best-case audio at default chunk size. Production audio (cellular, conference rooms, accent diversity) pushes the chunk size up and partial latency with it. Measure on your audio before committing.

The six trade-offs

Real-time speech translation as six engineering trade-offs.

Every architectural decision in real-time translation is a trade-off. Lower latency costs accuracy. Bigger models cost money. Aggressive signal processing flattens prosody. Pick a trade-off below to see the knob, what wins at each end, the production failure mode when the knob sits wrong, and the production fix.

Trade-off 01 · The knob: how soon you commit

Latency vs Accuracy.

Push knob this way

Commit fast (100 ms)

Wins on perceived latency. Loses on WER (14% on conversational EN).

Or push it this way

Wait for stability (300 ms)

Wins on accuracy (~8% WER). Loses on conversation rhythm.

Production data

Streaming Whisper Large-v3 commits transcripts ~300 ms after end-of-utterance. Forcing it to commit at 100 ms drops WER from ~8% to ~14% on conversational English. End-to-end S2ST (SeamlessM4T v2) hits sub-second latency at the cost of accuracy on long technical sentences.

Failure mode

Cascaded pipeline tuned for sub-1.2 s end-to-end starts mistranslating technical vocabulary. Users notice the wrong noun before they notice the latency win.

Production fix

Re-translation. The MT model re-translates the latest partial transcript every 200-500 ms. The displayed translation updates as the source becomes more complete. User sees text early. Final committed text reflects the full sentence. The biggest cascaded-pipeline optimization in 2026.

Trade-off 02 · The knob: model size

Model size vs Cost.

Bigger model

Whisper-large-v3 (1.55B)

Wins on accuracy and naturalness. 4-10× more expensive per minute.

Smaller model

Whisper-distil-large-v3

~96% accuracy retained, 6× faster, fraction of cost.

Production data

Whisper-distil-large-v3 keeps ~96% of Whisper-large-v3 accuracy on common English while running ~6× faster. NLLB-distilled-600M retains 90% of NLLB-3.3B BLEU. At 1,000+ concurrent translation streams, the differential is 4-10×.

Failure mode

Bulk traffic on premium models burns the budget before scale arrives. Bulk traffic on the smallest model surfaces accuracy gaps on the 5% of high-stakes sessions where it matters.

Production fix

Tiered routing. Smaller models for the bulk of traffic. Largest tier on a per-session flag for high-stakes sessions: CEO keynote, legal proceedings, medical consultation. Most production deployments route 80-90% of traffic to the cost-tuned tier.

Trade-off 03 · The knob: noise suppression aggression

Signal processing vs Naturalness.

Aggressive suppression

RNNoise on full strength

ASR WER -1.5%. Emotional prosody flattened in the output.

Light processing

Minimal suppression

Natural prosody preserved. Higher WER on noisy audio.

Production data

Aggressive RNNoise cuts ASR WER ~1.5% on conference audio. It also reduces emotional prosody on the translated output, making the synthetic voice sound robotic. Listeners notice. VAD thresholds tuned for low false-positives miss soft speech; tuned for low false-negatives chop quiet talkers.

Failure mode

Conference audio tuned aggressively delivers correct words but flat affect. Telehealth tuned lightly delivers natural voice but misses 6-8% more words on accented speech.

Production fix

Two parallel signal-processing paths. ASR sees the aggressively-suppressed stream (lowest WER). TTS prosody-extraction sees the lightly-suppressed stream (preserves emotional cues). Stitch them at the TTS layer. Per-vertical VAD threshold tuning closes the rest of the gap.

Trade-off 04 · The knob: jitter buffer depth

Buffering vs Responsiveness.

Larger buffer

120 ms jitter buffer

Absorbs network variance. Smoother audio. +80 ms latency.

Smaller buffer

40 ms jitter buffer

Responsive. Glitches on out-of-order packets.

Production data

A 40 ms jitter buffer on incoming WebRTC RTP plus a 300 ms ASR-stabilization window pushes 340 ms before MT sees text. At conference scale, network variance pushes the jitter buffer to 80-120 ms. The two latencies stack.

Failure mode

Fixed buffer tuned for stable networks shows glitches on cellular and mid-conference Wi-Fi. Fixed deep buffer makes 1:1 telehealth feel laggy when network is fine.

Production fix

Adaptive jitter buffer. Opus default plus aggressive ASR partial-emission. Pair with re-translation (Trade-off 1) so the user sees text early even when the buffer absorbs a spike.

Trade-off 05 · The knob: cascaded vs end-to-end

Error propagation in cascaded systems.

Cascaded

Three stages, multiplied errors

ASR error × MT error × TTS error compound.

End-to-end

One model, different errors

No multiplication. Lower vocab accuracy. Harder audit trail.

Production data

An 8% ASR WER + BLEU 32 MT model + TTS layer compounds to 12-18% perceived translation accuracy degradation versus an oracle text-in MT-only baseline. End-to-end models drop accuracy 5-15% on out-of-distribution technical terms vs cascaded ASR + MT.

Failure mode

Misheard "twenty" becomes mistranslated "twelve" becomes spoken nonsense. Domain-specific named entities especially vulnerable.

Production fix

Custom domain vocabulary at every stage. ASR custom-vocab lists (+5-8% WER lift on technical content). MT glossaries with brand and technical terms. TTS pronunciation overrides for named entities. The points unlocked here are the difference between "works" and "ships."

Trade-off 06 · The knob: shared vs dedicated workers

Scalability vs Quality Guarantees.

Autoscaled shared workers

Conference scale

Handles 132K simultaneous streams. Variance under load.

Dedicated per session

Court / regulated

99.95% delivery guarantee. Fixed cost per session.

Production data

Black Hat USA 2025 ran autoscaled workers for 22K participants × 6 languages = 132K simultaneous translation streams during keynote peaks. A high-stakes court transcription cannot tolerate a missed packet. Dedicated workers are the only safe path.

Failure mode

Conference architecture used for court delivery hits autoscale variance at the wrong moment. Court architecture used for conference broadcast cannot keep up with 132K-stream peaks.

Production fix

Two deployment shapes in one codebase. Default to autoscaled for events and broadcast. Flip to dedicated-worker mode on a per-session flag for regulated workflows. Translinguist and VOLO.live both ship this dual-mode pattern.

Every trade-off has a fix. The architecture you ship is the set of fixes you applied first. Skip a fix and the gap surfaces in production within the first month.

Under the hood of streaming MT

How simultaneous machine translation reads and writes.

Once a partial transcript arrives at the MT layer, a new decision starts. Should the model emit a translated token now (low latency, less context) or wait for more source tokens to arrive (more latency, better accuracy)? The decision policy is what separates a translation pipeline that feels real-time from one that feels sticky. Pick a policy to see how it works.

Wait-k · static canonical policy

Read k tokens, then alternate READ/WRITE.

The model reads the first k source tokens, then alternates READ and WRITE for the rest of the sentence. k is a fixed integer. Smaller k = lower latency + lower quality. Larger k = higher latency + higher quality.

Production sweet spot

Wait-3 to Wait-5

Wait-1

Research-grade

Wait-7

Feels laggy

Wait-3 trace: READ x1 → READ x2 → READ x3 → WRITE y1 → READ x4 → WRITE y2 → READ x5 → WRITE y3 ...

Vendors that quote 1.2 second cascaded latency are usually running wait-3 or wait-4 with a 250 ms ASR commit window underneath. Weakness: ignores syntactic structure. One k cannot fit EN→ES and EN→JA equally.

MILk · Monotonic Infinite Lookback attention

Learned alignment with unrestricted backward lookback.

The decoder learns a soft alignment with the source sequence. Once it commits to writing token t, it cannot retract. Lookback is unrestricted backward (hence "infinite"). Trains end-to-end with reinforcement-style rewards balancing quality and latency.

Policy type

Adaptive, learned

Commits

Cannot retract

Lookback

Unrestricted backward

Replaces the static wait-k weakness with a per-state decision. Production benefit: handles language-pair asymmetry (EN-JA vs EN-ES) inside one model without per-pair retraining.

Wait-Info / EDAtt

Information-content thresholding.

Each emitted target token carries an information-content estimate. The policy emits the next target token when the expected information gain from waiting drops below a threshold. EDAtt (Energy-Distortion Attention) uses attention-energy distribution to detect when enough source context has arrived for a confident emission.

Signal

Attention energy

Decision

Confidence-based

Tunable

Threshold per session

Better latency-quality frontier than static wait-k on most language pairs. Trade: harder to interpret because the threshold tuning is opaque to the operator.

SeqPO-SiMT · 2025 frontier

Sequential policy optimization with composite rewards.

2025 research frames the READ / WRITE decision as a multi-step stochastic process and optimizes a group-relative advantage with composite rewards for both quality (COMET) and latency (StreamLAAL). The result is a policy that adapts to language pair and sentence structure on the fly. Production-ready in late 2026 for premium-tier deployments.

Reward signal

COMET + StreamLAAL

Optimization

Group-relative advantage

Adapts to

Pair + sentence shape

Pushes the latency-quality frontier outward across language pairs without per-pair retraining. Caveat: training compute cost is significantly higher than wait-k baseline.

Re-translation · production cheat

Translate every partial. Update the display. Beat policy search.

The MT model re-translates the latest partial transcript every 200 to 500 ms. The user sees an early translation that updates as the source becomes more complete. The final committed translation reflects the full sentence, not an early prefix.

Cadence

200–500 ms

Accuracy lift

+8–15% vs wait-k

Display impact

Text updates inline

t=200 ms: "The patient reports..."
t=400 ms: "The patient reports persistent..."
t=600 ms: "The patient reports persistent headaches..."
t=800 ms (commit): "The patient reports persistent headaches for three weeks."

Works because target-side display tolerates updates. Audio TTS cannot retract — so re-translation for spoken output uses commit-then-stop with the cleanest committed prefix. Most 2026 production pipelines (DeepL, Google Translate Live, Translinguist) ship some form of re-translation.

The latency-quality frontier

Every policy plots somewhere on this curve.

Quality (BLEU or COMET) versus latency (StreamLAAL, in seconds). The frontier moves about 0.2 BLEU per second of latency at the operating point of practical real-time translation. IWSLT 2025 reference: Whisper Large-v3-Turbo + NLLB-3.3B hit BLEU 31.96 at StreamLAAL 2.94 seconds.

40 35 30 25 1 s 2 s 3 s 4 s 5 s latency · StreamLAAL quality · BLEU Wait-1 Wait-3 IWSLT 2025 ref · 2.94 s @ BLEU 31.96 Wait-7
Architecture choice

Cascaded vs end-to-end decision tool.

Six questions about your use case. Each one weighs the trade-off between cascaded and end-to-end speech-to-speech. Pick the answer that fits your reality. The recommendation, score, rationale, and a starter stack update live.

Live recommendation

Architecture pick

Cascaded

Default for 85% of production deployments.

Cascaded
End-to-end
Answer the questions to see the rationale.

Starter stack

Recommendation is a heuristic, not a contract. A complex stack decision deserves a 30-minute architecture review. Book one if the score is close.

Reference architecture

The 12 components every production real-time translation system needs.

Click any component to expand its role, vendor options, latency budget, and the canonical Fora Soft client running that pattern. Components sit around the cascaded ASR → MT → TTS core. Omit any one and the gap surfaces as a production incident in the first month.

Vendor matrix

The 2026 ASR / MT / TTS vendor matrix.

Pick a category. Filter by latency budget and HIPAA BAA. Surfaces the matching vendors with 2026 spec sheets. For a deep public-data comparison of the four leading event-platform vendors (DeepL Voice, KUDO, Interprefy, Meta SeamlessM4T), read Article A.

Filter:
Latency budget

The latency budget for production real-time speech translation.

Pick architecture and scenario. Each stage's latency contribution surfaces with the optimization lever. The user-perception band auto-highlights where the current total lands.

Architecture
Scenario

Total end-to-end latency

— ms

Source speech end to translated audio at the listener

Pick a scenario.

User perception band (highlighted = current scenario)

Sub-700 ms
Feels real-time. Users do not perceive latency.
700 ms – 1.5 s
Perceptible but acceptable. Conversation rhythm holds.
1.5 – 3 s
Visibly lagged. Users start interrupting.
3 – 5 s
Feels broken. Users revert to typed translation or human interpreter.
5 s+
One-way broadcast only. Unusable for two-way.
Production engineering

Six engineering decisions the latency budget hides.

The latency budget is the headline. The engineering work sits underneath it. Six decisions decide whether a clean 1.5-second design holds at 1.5 seconds in production or stretches to 3.5 seconds under real load. Pick a decision to see the production data, the failure mode, and the fix.

Decision 01 · ASR encoder

Chunk size at the ASR encoder.

A streaming Conformer encoder runs in chunks. Production chunk sizes range from 160 ms to 960 ms. Smaller chunks emit partials faster. Larger chunks give self-attention more acoustic context per pass, which lifts WER 1 to 3 percentage points.

Range

160 — 960 ms

Default

320 ms

WER lift on larger

+1 to 3 pp

Fix: The interaction with the jitter buffer matters. If the WebRTC jitter buffer holds 80 ms of audio and the ASR encoder waits for 320 ms chunks, the effective ASR latency floor is 400 ms before any decoding work begins. Tune jitter buffer and chunk size together, not separately.

Decision 02 · MT layer

Re-translation cadence at the MT layer.

Re-translation runs the MT model on every refreshed ASR partial. Run it too often (every 100 ms) and the GPU saturates. Run it too rarely (every 1 second) and the user sees a sticky cursor and stale text.

Production cadence

200 — 500 ms

GPU saturation point

~100 ms cadence

User sticky-text

~1000 ms cadence

Fix: Settle at 200 to 500 ms with debouncing on identical input strings. The MT response carries a commit-tag indicating whether the current emission can be considered stable. Display layer decides whether to flash the update or hold it.

Decision 03 · stage boundaries

Backpressure across the cascade.

ASR partials arrive at a rate the MT layer cannot always service immediately. MT outputs arrive at a rate the TTS cannot always voice immediately. Without backpressure, queues build up and end-to-end latency drifts up minute by minute.

Failure mode

Queue accumulation

Symptom

Latency drift over minutes

Detection

Per-stage queue depth

Fix: Bounded queues at each stage boundary with explicit drop policies. MT layer drops stale partials when a fresher one arrives. TTS layer drops queued chunks that no longer reflect the committed translation. Drop policy is a design decision, not a runtime accident.

Decision 04 · codec boundaries

Audio resampling at codec boundaries.

WebRTC carries Opus at 48 kHz. Most ASR models expect 16 kHz mono PCM. Most TTS models emit 24 kHz mono PCM. Resampling at the boundaries costs CPU and introduces sub-millisecond latency that adds up across stages.

WebRTC

48 kHz Opus

ASR expects

16 kHz PCM

TTS emits

24 kHz PCM

Fix: Use a low-latency resampler (linear interpolation for chat-quality, polyphase FIR for production). Resample once per stage, not multiple times. Common bug: 48k→16k→24k→48k across the cascade when the math could collapse to one resampling step.

Decision 05 · VAD boundary

Dual VAD at the right boundary.

VAD running on the publisher side trades client compute against network savings. The publisher's VAD detects silence and stops the upstream audio entirely. The server-side VAD acts as a backstop, detecting silence the publisher missed (echo, music, background voices).

Publisher VAD

Aggressive · saves bandwidth

Server VAD

Conservative · backstop

Production

Run both

Fix: Production pipelines run both VADs together. Publisher is aggressive (saves bandwidth, stops upstream). Server is conservative (catches edge cases publisher missed). Avoids the failure mode where a too-aggressive publisher VAD chops mid-sentence quiet speech.

Decision 06 · routing

Worker affinity and session state.

ASR, MT, and TTS workers are stateful within a session. ASR holds the encoder cache and prediction network state. MT holds conversation context for re-translation. TTS holds voice-clone embedding and prosody state. Routing the next packet to a different worker breaks state.

Routing

Consistent hash

Hash key

Session ID

Failover cost

Cold-start tail

Fix: Pin sessions to workers via consistent hashing on session ID. On worker failure, the session migrates to a replacement with cold-start cost. Hot standbys for the top 5% of sessions (high-stakes, premium tier) cut the cold-start tail. Replacement strategy is part of the SLO design.

Worked example

Translinguist's production cascaded pipeline.

Audio capture: WebRTC publisher VAD @ 350 ms silence; Opus 48 kHz, 20 ms frames Server transport: LiveKit SFU regional cluster; consistent-hash routing ASR: Whisper Large-v3 streaming; chunk 320 ms; lookback 320 ms; server VAD backstop @ 400 ms ASR post: punctuation restoration; custom vocab (medical + named entity); PHI redaction before MT crosses BAA boundary MT: DeepL API streaming; re-translation cadence 350 ms; failover to self-hosted NLLB-3.3B if DeepL > 1.2 s TTS: ElevenLabs Multilingual v2 streaming; per-practitioner voice clone; prosody contour passed from ASR via side channel Outbound: WebRTC track back into LiveKit room End-to-end median: 1.6 s
Use case → architecture

Use case → architecture matcher.

Six canonical use cases. Pick one to see the production-tested architecture, vendor stack, latency target, compliance shape, and the Fora Soft client running it.

Language pair complexity

Why some language pairs are harder than others.

Pick source and target language. Surfaces the tier, realistic BLEU, ASR WER, end-to-end latency target, recommended vendor stack, and known production failure modes for that pair.

How to measure

WER, BLEU, COMET, StreamLAAL — what each score means in practice.

Slide the score. See the example output at that level. Watch the threshold band light up. Map abstract numbers to real user perception.

WER · Word Error Rate

Word Error Rate measures ASR transcription accuracy.

Score 8%

Example output at this score

SRE for translation

What to measure when a translation goes wrong.

A translation pipeline has more failure surfaces than a regular WebRTC call. Audio can degrade. ASR can mishear. MT can mistranslate. TTS can mispronounce. Latency can stretch at any stage. Voice cloning can drift. Without per-stage telemetry, the user-report "the translation was bad" is unsolvable.

# Metric What it measures Threshold Alert
01 ASR commit latency End-of-utterance → stable transcript P95 < 500 ms P95 > 800 ms · 5 min
02 MT first-token latency Stable transcript → first translated token P95 < 250 ms P95 > 400 ms
03 TTS time-to-first-byte MT completion → first audio chunk at listener P95 < 300 ms P95 > 500 ms
04 End-to-end glass-to-glass Composite of all stages plus transport. The only number users feel. P95 < 2.5 s P95 > 3.5 s
05 ASR WER (sampled) Sampled audio against reference transcripts < 10% > 15%
06 MT BLEU / COMET (sampled) Sampled translation pairs against references COMET > 0.80 COMET < 0.70
07 Voice clone similarity drift 5% TTS sample run through speaker-similarity classifier Similarity > 0.85 Similarity < 0.75

Per-call replay · the SRE primitive

Store the full per-turn trace for every session.

Source audio + ASR partials + MT outputs + TTS audio. Retention 7 to 30 days. When a customer complains, replay the call. Every stage timed. Every model output captured. Every tool result recorded. The only way to debug "the translation was wrong at 14:23" reports that arrive 3 days after the session ended.

Storage cost: ~5 MB / minute audio + ~100 KB / minute telemetry. At 100K minutes / month: 500 GB compressed audio + 50 GB telemetry. Cheap insurance.

Alert 01 · user-pain

End-to-end P95 over 3.5 s for 5 minutes.

e2e_latency.p95 > 3500ms FOR 5m

Real user pain. Pages on-call. Pull the per-stage breakdown to localize: which stage drifted? ASR? MT? TTS? Or a transport regression?

Alert 02 · stage regression

ASR commit latency P99 over 1.5 s.

asr_commit_latency.p99 > 1500ms

ASR regression or encoder backpressure. Usually triggered by a model version swap, a worker pool exhaustion, or a region-specific cold-start storm.

Alert 03 · quality regression

MT BLEU on sampled traffic below 25.

mt_bleu_sampled < 25 FOR 30m

Model regression or vocabulary mismatch. Usually triggered by a vendor model update or by a customer-specific domain shift. Check the eval harness against the current model version.

Alert 04 · capacity

Worker capacity over 80% for 5 minutes.

worker_cpu.p95 > 0.80 FOR 5m

Capacity exhaustion. Autoscale is lagging or hitting a ceiling. Provision more workers. Check autoscale policies and queue depth feedback.

Everything else is dashboards, not pages. Most translation fleets drown in noise. The four above are the only alerts that pay rent. Build the rest as dashboards your SRE team can pull on demand.
Voice cloning

What voice cloning actually transfers — and what it leaves behind.

In a cascaded pipeline, default TTS substitutes a synthetic voice for the source speaker. Listeners hear their doctor's words read by a stranger. Voice cloning solves this. Three patterns ship at production quality in 2026. Three transfer problems hide inside each one. Switch the view to see both.

Pattern 01 · zero-shot

Instant voice cloning.

ElevenLabs Instant Voice Cloning, Cartesia voice clone. Take 30 seconds of source audio, generate a voice profile, use it for the rest of the call. No training. Lowest friction.

Reference audio30 sec
Training timeNone
EN similarity~98%
Cross-lingual−10 to −20 pp
Pattern 02 · trained

Professional voice cloning.

ElevenLabs Professional Voice Cloning, Microsoft Custom Neural Voice. 30+ minutes of training audio. Near-indistinguishable quality. Required for high-stakes deployments and most jurisdiction approvals.

Reference audio30+ min
Training time6 — 12 hours
EN similarity~99%
Jurisdiction approvalYes
Pattern 03 · end-to-end

SeamlessExpressive speaker preservation.

SeamlessM4T v2 Expressive. The S2ST model itself preserves source speaker style across the target language. No separate voice-cloning stage. Fewer compliance complications than a separate cloning step.

Reference audioLive signal
Separate stageNone
Language coverageEN ↔ 5 langs
Compliance shapeSimpler

Timbre, prosody, emotion — each transfers differently.

Three problems hiding in one label.

Transfer 01 · spectral shape

Timbre transfer.

Captures the spectral shape that makes a voice sound like a specific person. The model extracts a speaker embedding (256–512 dimensional vector) and conditions the decoder on it.

EN same-language95%
Cross-lingual75–85%

Cross-lingual loses 10–20 pp because the model interpolates phonemes the source voice never produced.

Transfer 02 · rhythm + pitch

Prosody transfer.

Captures the rhythm, stress, and pitch contour. Standard TTS does this from text-level cues. Expressive S2ST does it from the source audio directly.

S2ST Expressive~88%
Cascaded + pitch-energy contour~60%

Listeners notice when cadence is off even when the timbre is right.

Transfer 03 · affect

Emotional transfer.

Captures the affect: excited, somber, sympathetic, irritated. The hardest of the three. Few production systems handle it cleanly outside the SeamlessExpressive variant.

SeamlessExpressive~70%
Cascaded default TTS~20%

Most cascaded pipelines flatten emotion to a neutral baseline. Listeners feel the loss without naming it.

Rendering note: Modern TTS does not synthesize waveform directly. The decoder produces discrete units (HuBERT, WavLM, or proprietary). A separate vocoder (BigVGAN, Vocos, PRETSSEL for SeamlessExpressive) renders units into 24 or 48 kHz audio via flow matching — 1 to 8 forward passes versus tens for diffusion. Use the matched vocoder for whichever TTS you ship; unit mismatch produces audible artifacts.
Compliance

The five regulatory frames every production translation system handles.

Real-time speech translation processes spoken audio, often handles biometric voice data, and increasingly performs identity preservation via voice cloning. Six regulatory frames every production deployment plans around. Pick a frame to see the requirements, the deadlines, and the engineering cost.

Frame 01 · US healthcare

HIPAA — telehealth and US healthcare.

Self-host the pipeline on BAA-able cloud. Every vendor in the chain must sign a BAA. PHI redaction between ASR and MT stages. Encrypted recording storage with six-year retention. Translinguist's telehealth deployment is the canonical reference.

BAA chain: AWS, GCP, or Azure — all offer BAAs. ASR: Deepgram Enterprise, AssemblyAI Enterprise, Whisper self-host. MT: Azure Translator, AWS Translate, NLLB-3.3B self-host. TTS: ElevenLabs Enterprise, Cartesia Enterprise, Azure Neural.
PHI redaction: Between ASR and MT stages. Drop diagnosis codes from transcript before MT sees them.
Encrypted recording: Six-year retention. AES-256 at rest. Key rotation.
Audit logging: Every join, leave, recording access, admin action. Retained 6 years.
Access control: RBAC, MFA, automatic session termination.

Engineering effort

4–6 engineer-weeks

Chain-break cost

One vendor without BAA = entire chain non-compliant

Canonical reference

Translinguist telehealth

Frame 02 · EU AI regulation

EU AI Act — 2 August 2026 deadline.

Translation systems serving EU users from 2 August 2026 fall under Article 50 transparency obligations plus general-purpose AI rules. Four engineering deliverables.

Deadline 2 August 2026: Article 50 transparency obligations go live for all general-purpose AI systems serving EU users.
(1) AI disclosure: At the start of every session. "This conversation is being translated by AI." Logged and consent-captured.
(2) Interaction logging: Timestamps, source / target language, model versions, tool calls, agent decisions. Retained per Article 50.
(3) Provider compliance evidence: Documentation that the underlying ASR / MT / TTS models meet EU AI Act conformance — typically the vendor's published model card.
(4) Bias and accuracy reporting: For high-risk uses (employment, education, healthcare, law enforcement, justice administration, asylum, border control).

Engineering effort

4–8 engineer-weeks · standard-risk

High-risk uplift

+ 6–12 weeks

Audit retention

As specified Article 50

Frame 03 · EU privacy

GDPR — EU data subjects.

Lawful basis for processing the audio and transcript. Right to erasure within 30 days. DPIA before launch for high-risk processing. Data residency.

Lawful basis: Consent, contract, or legitimate interest. Documented per Article 6.
Right to erasure: Within 30 days of request. Recording deletion + transcript deletion + downstream model log redaction.
DPIA: Documented before launch for high-risk processing.
Data residency: EU-only storage and processing for EU subjects. Self-hosted on AWS Frankfurt / Dublin / Stockholm or GCP Belgium / Netherlands.

Frame 04 · legal proceedings

Court of record — jurisdiction-specific regulations.

AI translation in court-of-record settings serves as transcription support. Certified human interpreters remain mandatory for binding proceedings in nearly every developed-country jurisdiction.

US: NAJIT (National Association of Judiciary Interpreters and Translators) credentials required for federal-court interpretation. State courts vary.
UK: NRPSI (National Register of Public Service Interpreters). High-stakes proceedings require certified human interpreters.
EU: Directive 2010/64/EU on the right to interpretation and translation in criminal proceedings. AI translation is generally inadmissible without certified human review.
Canada: ATIO (Association of Translators and Interpreters of Ontario) and provincial equivalents.
Binding proceedings rule: AI alone is not admissible. Human-in-the-loop is mandatory.

Frame 05 · US recording law

Two-party consent states.

In two-party-consent states, both parties must consent to recording. Translated audio recording counts. The opening disclosure must capture explicit consent.

Two-party states: California, Florida, Illinois, Maryland, Massachusetts, Montana, Nevada, New Hampshire, Pennsylvania, Washington.
Required disclosure: "This call is being translated by AI and may be recorded. Press 1 or say 'yes' to continue, or stay on the line to opt out."
Consent capture: Logged with timestamp and session ID. Retained alongside recording.

Frame 06 · biometric data

Biometric voice data — voice cloning regulation.

Voice cloning processes biometric data. Multiple jurisdictions regulate biometric processing. Consent capture before cloning. Explicit retention and deletion policies. Documented purpose limitation.

CCPA / CPRA (California): Biometric data is a "sensitive personal information" category. Notice + opt-out + deletion within 45 days.
Illinois BIPA: Written consent before collection. $1K–$5K per violation. Class-action exposure.
EU GDPR Article 9: Voice = biometric data when used for identification. Explicit consent or one of the Article 9(2) exceptions.
Voice cloning consent: Always capture written consent before training a clone. Retention default 30 days after session unless explicitly extended.
Cost calculator

Translation cost calculator.

Pick stack tier, monthly minutes, target languages, voice cloning, deployment mode. Surfaces per-minute cost, monthly cost, break-even vs human interpreter, and the cost-component breakdown.

8,000
1
$50

Live result

Monthly translation cost

$—

Per minute (per language)

Equivalent human cost

at picked interpreter rate

Monthly savings vs human

Build payback (months)

assumes $30K MVP build

Numbers reflect 2026 production pricing. Self-hosted savings assume amortized infrastructure cost. Add ~20% overhead for ops, observability, recording storage. Mileage varies by language pair complexity and voice verbosity.

Scale tier

Scale tier — architecture per concurrency.

Real-time translation scales differently from regular WebRTC. ASR / MT / TTS workers are stateful. They hold a session per participant. Adding a participant adds a worker (or a worker slot). Adding a language adds a parallel MT and TTS chain. Pick a tier to see the architecture, vendor stack, autoscale strategy, and the canonical Fora Soft client at that scale.

Tier 01 · 1:1 · 2 participants
1-2 language pairs · single worker pool per pair

1:1 telehealth and patient–practitioner interpretation.

A single worker pool per language pair. Cascaded ASR → MT → TTS with voice cloning. HIPAA-compliant vendor stack. Self-hosted infrastructure. The simplest scale shape with the strictest compliance.

Architecture

Cascaded · voice clone

ASR → MT → TTS, BAA chain

Latency target

Sub-2.0 s

Conversational rhythm

Concurrency

1–500 sessions

Tied to clinician availability

Vendor stack

Whisper Large-v3 (HIPAA self-host) · NLLB-3.3B self-host or DeepL · ElevenLabs Multilingual v2 Enterprise with trained voice cloning per practitioner.

Workers per session

1 ASR worker + 1 MT model invocation + 1 TTS worker. Stateful. Dedicated, not shared.

Autoscale strategy

Warm pool of 3-5 workers per region. Scale up on dispatch queue depth. Never share workers across sessions.

Failure mode at this tier

PHI leaking between ASR and MT stages. Voice clone drifting on a practitioner who has not been freshly trained. BAA chain breaking on a vendor without a fresh BAA renewal.

Fora Soft shipped at this tier

Translinguist telehealth — 16+ language pairs in production. Patient hears the doctor speak their language directly via trained voice clone. PHI redaction between ASR and MT.

Tier 02 · 3-20 participants · small group meeting
1-4 language pairs · multi-target fan-out

Small group meetings with multi-target fan-out.

Cascaded with multi-target fan-out. Source ASR runs once. MT and TTS fan out per target language. Each additional language adds parallel MT + TTS workers, not parallel ASR.

Architecture

Cascaded · multi-target fan-out

1 ASR → N MT → N TTS

Latency target

1.5-2.5 s

Group meeting rhythm

Concurrency

1-100 rooms

3-20 participants × 1-4 langs

Vendor stack

Deepgram Nova-3 streaming · DeepL API or GPT-4o-mini prompted · Cartesia Sonic 3 streaming TTS per target language.

Fan-out math

Source ASR cost shared across N targets. MT and TTS cost multiplied by N. 4 languages roughly 3× the per-minute cost of 1 language (not 4×).

Autoscale strategy

Pre-warmed worker pools per region. Scale on dispatch queue depth. Per-language TTS pools sized by historical traffic.

Failure mode at this tier

One target language's TTS pool exhausting while others have headroom. Mitigation: per-language pool autoscale, not aggregate.

Fora Soft shipped at this tier

Custom business meeting translation deployments. Sales calls and B2B meetings with cloned rep voice in target language.

Tier 03 · 50-500 participants · mid-size conference
4-8 language pairs · regional SFU cascade

Mid-size conferences with regional SFU cascade.

Cascaded translation with regional SFU cascade. ASR / MT / TTS workers autoscaled on dispatch queue depth. Per-language audio tracks generated server-side. Listeners subscribe to their preferred language.

Architecture

Cascaded · SFU cascade

Regional workers, autoscaled

Latency target

1.5-3.0 s

Conference tolerance

Concurrency

200-4,000 streams

50-500 × 4-8 langs

Vendor stack

LiveKit SFU regional cascade · Deepgram Nova-3 · DeepL API Pro with session glossaries · ElevenLabs Multilingual v2.

Worker pool sizing

~500 active translation streams per c5.4xlarge equivalent. Multi-region for cross-continent participants. Anycast routing.

Autoscale strategy

Pre-warmed pools per region per language. HPA on dispatch queue depth × language. Scale-up trigger at 70% capacity.

Failure mode at this tier

Single-region collapse on cross-continent panels. Cold-start latency spikes during keynote opening. Mitigation: pre-warmed pools, regional anycast.

Fora Soft shipped at this tier

Rafiky conference interpretation platform with both AI-first translation and routed human interpreter sessions.

Tier 04 · 500-5,000 participants · large event
6-8 languages · WebRTC + LL-HLS hybrid

Large events with WebRTC + LL-HLS hybrid delivery.

WebRTC SFU for talent and panel. LL-HLS broadcast layer for the rest of the audience. Cascaded translation taps off the WebRTC stream. Per-language LL-HLS audio tracks for listener subscription via CDN edge cache.

Architecture

Hybrid · WebRTC + LL-HLS

Talent layer + broadcast layer

Latency target

2-4 s

LL-HLS broadcast budget

Concurrency

3K-40K streams

500-5K × 6-8 langs

Vendor stack

LiveKit SFU multi-region · Deepgram Nova-3 · DeepL API + custom MT · ElevenLabs Flash for fast streaming TTS · CDN edge cache for LL-HLS fan-out.

Per-language fan-out

Translation generates per-language audio tracks. CDN distributes. Listener picks language client-side. CDN egress dominates cost above 5K listeners.

Autoscale strategy

Pre-warmed translation worker pools sized for keynote peak. CDN auto-handles broadcast fan-out. Translation cost scales with active languages, not viewers.

Failure mode at this tier

CDN cache miss storm at event start. Audio drift between WebRTC talent and LL-HLS broadcast layers. Fix: pre-warm CDN edges; explicit timing-source sync.

Fora Soft shipped at this tier

Conference interpretation at TED-style events. KUDO and Interprefy default tier.

Tier 05 · 5,000-25,000 participants · major virtual event
6+ languages · Cloud + self-host hybrid

Major virtual events with Cloud + self-host hybrid deployment.

Public broadcast on Cloud SFU autoscale absorbs traffic spikes. Private speaker tracks run on self-hosted infrastructure for control over voice-cloning layer. Per-language audio fan-out through CDN.

Architecture

Hybrid · Cloud + self-host

Public + private tracks

Latency target

Sub-3 s

Black Hat 2025 delivered

Concurrency

30K-200K streams

5K-25K × 6+ langs

Vendor stack

LiveKit Cloud + self-host hybrid · streaming Deepgram · DeepL Voice + custom MT · ElevenLabs Flash · custom moderation pipeline · multi-CDN failover.

Peak load math

22K listeners × 6 languages = 132K simultaneous translation streams at keynote peaks. CDN egress sustained at 5-10 Gbps for audio fan-out.

Autoscale strategy

Cloud handles public broadcast autoscale. Self-hosted reserved capacity for speakers. Pre-warmed pools 2× expected peak. Standby workers on standby for surge.

Failure mode at this tier

A regional CDN outage at keynote peak. Translation worker pool not pre-warmed sufficiently for opening spike. Fix: multi-CDN with anycast, 2× pre-warmed pools.

Fora Soft shipped at this tier

VOLO.live at Black Hat USA 2025 — 22,000 participants, 6 languages, sub-3-second end-to-end latency throughout the event. Hybrid Cloud + self-host deployment.

Tier 06 · 25,000+ participants · broadcast scale
On-demand language switching · WebRTC + MoQ

Broadcast scale with on-demand language switching.

WebRTC for the talent layer. LL-HLS or MoQ for broadcast fan-out. Translation generates per-language audio tracks. CDN edge cache. Listener picks language client-side. Translation cost scales linearly with active languages, not viewers.

Architecture

WebRTC + MoQ / LL-HLS

Talent + global fan-out

Latency target

2-5 s

Broadcast tier

Concurrency

25K to millions

CDN egress dominates cost

Vendor stack

Custom WebRTC SFU + Deepgram + DeepL/NLLB + ElevenLabs/Cartesia · MoQ (Media over QUIC) for sub-500 ms broadcast · LL-HLS via Cloudflare Stream or AWS Elemental MediaPackage · multi-CDN failover.

Cost shape

CDN egress dominates. 100K concurrent at 64 kbps audio per language ≈ 6.4 Gb/s sustained per language. Translation cost scales with active languages (6-12 typical), not viewer count.

Autoscale strategy

Translation worker count fixed by language count. CDN autoscales broadcast fan-out. Multi-CDN with anycast routing for global delivery.

Failure mode at this tier

CDN regional outage at major event. Cache-miss storms at keynote start. Audio drift across language tracks. Mitigation: multi-CDN, pre-warmed edges, explicit timing-source sync.

Fora Soft shipped at this tier

Olympic-scale broadcasts. KUDO and Interprefy enterprise tier reference architecture.

Scale is the first decision. The vendor choice (Deepgram vs Whisper vs AssemblyAI, DeepL vs NLLB, ElevenLabs vs Cartesia) matters less than tier — most production vendors handle Tier 02 through Tier 04 well at the right tuning. The production reality is that most platforms ship per-tier autoscale logic that picks workers per session shape. Fora Soft has shipped at every tier above.

Production examples

Three shipped real-time speech translation systems.

Telehealth interpretation across 16+ language pairs. Live multilingual translation at Black Hat USA 2025. Conference interpretation platform mixing AI and human interpreters. Three production builds running today across very different shapes.

Translinguist · 16+ language pairs

Telehealth, legal, live events.

Architecture. Cascaded ASR → MT → TTS with voice cloning at the TTS layer.

Outcome. 16+ language pairs in production. Sub-two-second cascaded end-to-end latency. Voice cloning preserves speaker identity. The HIPAA-compliant BAA chain runs from the cloud provider through every model vendor. PHI redaction between ASR and MT stages prevents identifiers from being persisted in translation logs.

CascadedHIPAAVoice cloning
VOLO.live · Black Hat USA 2025

22,000 participants in six languages.

Architecture. Hybrid Cloud + self-host. Cascaded translation with conference-scale autoscale on the public broadcast tier. Self-hosted speaker tracks for control over voice cloning.

Outcome. 22,000+ participants at Black Hat USA 2025. Six-language live translation. Sub-three-second end-to-end latency at peak load. Scale shape: 22K listeners × 6 languages = 132K simultaneous translation streams at keynote peaks. Per-language audio tracks generated server-side and distributed via CDN edge cache.

22K participants6 languagesHybrid deployment
Rafiky · conference interpretation

Hybrid AI + human interpreters.

Architecture. Cascaded ASR → MT → TTS with human-interpreter fallback option for high-stakes sessions.

Outcome. Production conference interpretation platform serving multi-language events at scale. Architecture mirrors KUDO and Interprefy in shape: cascaded AI for routine multilingual broadcast, human interpreters bookable on demand for high-stakes sessions. The mixed AI / human model is the dominant 2026 pattern for premium conference work.

Hybrid AI/humanConferenceRouted marketplace
Decision framework

Build, buy, or hybrid — when each one wins.

Three architectural paths for shipping a real-time speech translation product. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether translation is the product or a feature of another product. Still deciding whether you need AI at all, or a human translator, or a human interpreter? Read the 2026 decision tree first. The framework below picks up after that decision.

Cost ranges are 2026-indicative. Implementation specifics — concurrency target, language pair count, compliance scope, voice cloning vs synthetic voices, custom glossary depth — dominate the spread within each tier.

A custom real-time speech translation system costs $20K–$150K to build over 1–4 months. KUDO and Interprefy ship event-ready in days. Hybrid (AI for scale, human for stakes) is the dominant 2026 pattern for premium conference and event work.
FAQ

Twelve questions every real-time translation architecture review covers.

What is real-time speech translation?

Real-time speech translation converts spoken audio in one language into spoken audio or text in another with low enough latency for natural two-way conversation. Two architectures dominate in 2026: cascaded (ASR → MT → TTS, three sequential models, 1.2-2.5 s end-to-end) and end-to-end speech-to-speech (Meta SeamlessM4T v2, DeepL Voice, 200-700 ms end-to-end). Cascaded is the production default for 85% of deployments. End-to-end is rising for premium consumer and conference use cases.

How does live video translation work?

A live video translation system captures audio from a WebRTC video session, runs streaming ASR to produce a transcript, translates the transcript with a machine-translation model, generates target-language speech with text-to-speech, and publishes the translated audio back as a parallel audio track. Listeners subscribe to their preferred language. Total round-trip typically lands at 1.2 to 3.0 seconds.

What is the difference between cascaded and end-to-end speech translation?

Cascaded uses three separate models: ASR turns speech into text, MT translates the text, TTS turns translated text back into speech. End-to-end uses one model that takes source audio in and emits target audio out directly. Cascaded offers vendor flexibility, per-stage observability, and PII redaction between stages, at the cost of higher latency. End-to-end offers sub-second latency and prosody preservation, at the cost of vendor lock-in, weaker observability, and lower accuracy on technical vocabulary.

How accurate is real-time speech translation?

Accuracy depends on language pair and architecture. For top pairs (English ↔ Spanish, French, German, Mandarin), cascaded systems achieve BLEU scores of 38 to 45, near-human on conversational content. For medium-resource pairs (English ↔ Vietnamese, Turkish, Hindi), BLEU drops to 28 to 37. For low-resource pairs, BLEU sits below 28 without domain fine-tuning. Streaming ASR adds 5 to 8 WER percentage points to end-to-end accuracy. Plan for 10 to 20 percent perceived translation quality degradation versus an oracle text-in text-out baseline.

What is the latency of real-time speech translation?

Glass-to-glass latency targets in 2026: cascaded production median 1.4 to 1.7 seconds. End-to-end S2ST 400 to 700 ms. Below 700 ms feels real-time. 700 ms to 1.5 s is conversationally acceptable. 1.5 to 3 s is visibly lagged. Above 3 s feels broken. The biggest latency lever is streaming at every stage. The second biggest is co-located regions for SFU, ASR, MT, and TTS workers.

Which vendors offer real-time translation APIs?

ASR: Deepgram (Nova-3, Flux), AssemblyAI Universal-Streaming, OpenAI Realtime STT, Whisper Large-v3, Azure, Google Cloud Chirp 2, Speechmatics, NVIDIA Canary. MT: DeepL API, Meta NLLB-3.3B, Microsoft Translator, Google Cloud Translation, GPT-4o-mini and Claude Haiku (prompted), Amazon Translate. TTS: ElevenLabs Multilingual v2, Cartesia Sonic 3, Azure Neural, OpenAI TTS, Google Cloud TTS, Deepgram Aura, PlayHT. End-to-end: Meta SeamlessM4T v2, DeepL Voice, Google Translatotron 3, OpenAI Realtime API multilingual, Gemini Live.

How much does real-time speech translation cost?

Cascaded per-minute cost ranges from $0.03 (budget self-host stack) to $0.17 (premium with voice cloning), single language pair. Each additional target language adds $0.03 to $0.15 per minute. End-to-end speech-to-speech self-hosted runs $0.015 to $0.040 per minute. Conference platform pricing (KUDO, Interprefy) runs $5K to $50K per major event. Custom build costs $20K to $150K depending on scope. Ongoing operations $2K to $10K per month.

What is voice cloning and when do I need it?

Voice cloning preserves the source speaker's voice characteristics in the target language. Listeners hear their own doctor or speaker speak their language directly rather than a generic synthetic voice. ElevenLabs Instant Voice Cloning (zero-shot, 30 seconds of audio), ElevenLabs Professional Voice Cloning (trained, 30+ minutes of audio), and Meta SeamlessM4T v2 Expressive all support voice preservation. Use it for premium 1:1 telehealth and consumer products. Skip it for evidentiary recordings, regulated workflows where synthetic voices are mandated, or when source consent is unclear.

How do I make a real-time translation system HIPAA-compliant?

Four layers. (1) Self-host on BAA-able infrastructure. (2) Use BAA-compliant ASR (Deepgram Enterprise, AssemblyAI Enterprise, Whisper self-host), BAA-compliant MT (Azure Translator, AWS Translate, NLLB-3.3B self-host), BAA-compliant TTS (ElevenLabs Enterprise, Cartesia Enterprise). (3) PHI redaction between ASR and MT stages. (4) Audit logging, encrypted storage with six-year retention, RBAC, automatic session termination. Fora Soft has shipped HIPAA-compliant translation deployments (Translinguist).

What does the EU AI Act require for my translation system?

From 2 August 2026, translation systems serving EU users must: (1) Disclose AI translation at the start of every session, (2) Log the interaction with timestamps, source and target language, and model versions, (3) Document foundation-model provider compliance, (4) Report on bias and accuracy for high-risk uses. Engineering effort: 4 to 8 engineer-weeks for a standard-risk use case.

Can AI translation replace human interpreters?

No, and yes. AI undercuts human interpreters on cost from the first hour and matches them on quality for routine multilingual broadcast in top language pairs. Human interpreters remain mandatory in 2026 for certified court-of-record proceedings, high-stakes medical procedures, top-tier conference keynotes, regulated immigration and asylum hearings, and most jurisdictions' legal requirements. The dominant 2026 pattern is hybrid: AI for scale, human for stakes, routed per session by KUDO, Interprefy, Translinguist, and Rafiky-style platforms.

Why does my live translation feel laggy when the vendor claims sub-second latency?

Vendor latency claims usually measure one stage in isolation. End-to-end production latency stacks all stages plus network: audio transport, VAD, ASR, ASR commit, MT, TTS, outbound transport, playout. A cascaded production pipeline lands at 1.2 to 2.5 seconds end-to-end even when each stage hits its individual sub-second mark. Architecture choice and re-translation discipline matter more than any vendor's marketing-quoted latency.

Where this guide goes deeper

Connected guides and references.

Each piece below picks up where this pillar ends. The decision tree if you have not picked AI yet. The vendor synthesis if you have. The engineering playbook if you are building. The streaming engineering guide if your translation rides on a live stream.

Have a specific translation architecture question?

Engineer-to-engineer review on the first call.

If you are scoping a real-time speech translation system and want a second opinion on cascaded-versus-end-to-end, the vendor stack, language-pair complexity, the voice-cloning consent shape, or the EU AI Act compliance approach, write us. A senior engineer with shipped translation platforms in production replies within 24 hours.

Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.

+1 (914) 775-5855
New York · USA
© Fora Soft, 2005–2026
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.