Published 2026-06-01 · 18 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
Localization used to be a budget line you approved once a year and a vendor you emailed; it is now a feature you ship inside your own product, and someone on your team has to make it work. This lesson is written for the product manager, founder, or engineering lead who needs to add "watch this in Spanish" or "turn on subtitles" to a video platform and wants to understand the moving parts well enough to scope the work, judge a vendor, and avoid the failures that make a localized video worse than no localization at all. It sits downstream of the speech lessons — the streaming ASR lesson and the WhisperX diarization lesson for transcription, the voice cloning lesson and the streaming TTS lesson for the voice — and it assembles those parts into one shippable feature.
Three Ways To Cross A Language Barrier
Before any code, decide which of three products you are building, because they cost different amounts, carry different risks, and serve different viewers. People say "translate my video" and mean one of three very different things.
The lightest is subtitles: the original audio plays untouched, and translated text appears at the bottom of the screen. The viewer hears the real voice and reads the meaning. This is the cheapest to produce, the safest legally — no one's voice is cloned — and the easiest to get right, but it asks the viewer to read, which not every audience will do.
The middle option is voice-over, sometimes called UN-style or lectoring. A synthetic narrator speaks the translation over the top while the original audio is lowered (ducked) but still faintly audible underneath. The viewer hears the new language without having to read, and the lips on screen no longer match — but for documentaries, news, training videos, and explainer content, nobody minds, because the speaker's mouth was never the point.
The heaviest is a full dub: the original voice is removed and replaced with a translated voice, ideally cloned to sound like the original speaker, and ideally timed and lip-synced so the mouth movements line up. This is the cinema-grade option. It is the most expensive, the most error-prone, and the one that triggers the law, because a cloned voice of a real person is a deepfake.
The mistake teams make is reaching for the full dub by default because it sounds the most impressive, when subtitles or voice-over would serve the audience better at a fraction of the cost and risk. Pick the lightest mode that does the job.
Figure 1. The three localization modes. They look like one feature but are three products with different costs, risks, and audiences.
One Pipeline, Three Outputs
Here is the idea that organizes everything below: all three modes run through the same front half of one pipeline, then branch. The shared front half answers two questions — what was said and what does it mean in the target language — and every mode needs both answers. The branch is only about how the answer is delivered: as text on screen, as a voice over the top, or as a replacement voice.
Walk the stages left to right. A video comes in. Transcription turns its speech into text with a timestamp on every word and a label for who spoke. Translation turns that text into the target language while trying to keep it the right length. Then the pipeline forks. For subtitles, a typesetting stage chops the translation into on-screen cues that obey reading-speed rules and writes a subtitle file. For voice-over and dubbing, a speech stage turns the translation into audio with a text-to-speech or voice-cloning model, a timing stage stretches or rephrases it to fit the original duration, an optional lip-sync stage warps the mouth, and a mux stage muxes the new audio back onto the video.
The reason to see it as one pipeline is practical: the expensive, accuracy-critical work — recognition and translation — is shared, so you build it once and reuse it across all three products. The cheap-to-describe, easy-to-underestimate work — timing — is where each mode succeeds or fails.
Figure 2. One pipeline, three outputs. Recognition and translation are shared; only the delivery branch changes.
Stage One: Transcription — Get The Words And The Clock
The pipeline starts by turning speech into text, a job done by an automatic speech recognition model — software that listens to audio and writes down the words, abbreviated ASR. The dominant open model in 2026 is OpenAI's Whisper and its production wrapper WhisperX, which we cover in the WhisperX lesson; managed services such as Deepgram and AssemblyAI cover the same job, and we compare them in the streaming ASR lesson.
For localization, ASR has to produce three things, not one. The obvious output is the words. The two outputs people forget are the clock and the cast. The clock is a timestamp on every word — the moment it starts and ends — because both subtitles and dubbing have to be placed in time, and a transcript without timing is useless for either. The cast is a speaker label — "this sentence was Anna, that one was Boris" — produced by a step called diarization, which matters the moment a video has more than one person talking, because a dub needs a different cloned voice per speaker.
The quality of this stage is measured by word error rate, abbreviated WER, which counts the mistakes per hundred words. The formula adds up three kinds of error and divides by the number of words that were actually said:
WER = (substitutions + deletions + insertions) / words_in_reference
If a 100-word clip comes back with 3 wrong words, 1 dropped word, and 1 invented word, the arithmetic is:
WER = (3 + 1 + 1) / 100 = 0.05 = 5%
A 5% word error rate sounds small, but every one of those errors flows downstream: a misheard word gets mistranslated and then spoken confidently in the wrong language. The transcription stage is where errors are cheapest to catch and most expensive to ignore, which is why a serious pipeline lets a human correct the transcript before translation, especially for proper names and jargon the model has never heard.
Stage Two: Translation — And The Length Problem That Breaks Naive Pipelines
Translation is the stage everyone expects to be hard and that modern models mostly handle well. The trap is somewhere else: length. Languages do not use the same number of syllables to say the same thing. A line that takes three seconds in English can take three and three-quarter seconds in German or Spanish — translation expansion of around 20–30% is normal for those pairs. The translation is correct; it simply does not fit the time the original speaker took to say it.
For subtitles this means a cue that should sit on screen for three seconds now holds more text than a viewer can read in three seconds. For dubbing it is worse: the synthetic voice has to cram a longer sentence into the same slot, so it either speeds up — the chipmunk dub everyone has heard and hated — or it overruns and collides with the next line.
The engineering term for the goal is isochrony: the translated speech should be time-aligned with the original, matching not just the overall length but the pauses and, for lip-sync, the mouth movements. There are two ways to get there, and good pipelines use both.
The first is length-controlled translation: ask the translation model for a version that is not just accurate but close to a target length, measured in characters or estimated speaking time. Research systems such as isochrony-aware machine translation and VideoDubber bake the duration target into the translation step itself, so the model prefers a shorter synonym over a longer one when both are correct. The second is bounded time-stretching: after the voice is generated, stretch or compress the audio to fit — but only within a narrow band, because past roughly ±10–15% the human ear hears the distortion. If the translation is still too long after both, the pipeline shortens the wording rather than speeding the voice past the point of naturalness.
Translation quality is scored two ways. The old standard is BLEU, which counts how many word sequences the machine output shares with a human reference translation; it is fast and cheap but crude. The 2026 default for anything serious is COMET, a neural metric that was trained to predict human quality judgments and correlates with them far better than BLEU. Use BLEU for a quick regression check and COMET for a real quality gate.
Figure 3. The length problem and its two fixes. Translation expansion is normal; isochrony is the goal; ±10–15% is the time-stretch a listener will tolerate.
Stage Three: The Voice — Synthetic Narrator Or Cloned Speaker
Once you have a translation that fits, you turn it into sound. Two paths exist, and the choice carries both a quality and a legal weight.
The plain path is text-to-speech, abbreviated TTS — software that reads text aloud in a generic but natural voice you pick from a library. This is right for voice-over: a documentary narrator, a training-course presenter, a news read. There is no real person's voice being copied, so there is no consent question. We compare the production TTS engines — ElevenLabs, Kokoro, Cartesia, OpenAI — in the streaming TTS lesson.
The heavier path is voice cloning — capturing the timbre and cadence of the original speaker from a short sample so the dub sounds like them speaking the new language. This is what makes a full dub feel native, and it is also what turns the output into a deepfake of a real person, with the consent and disclosure duties that follow. We go deep on the cloning mechanics and the consent engineering in the voice cloning lesson.
Voice quality is measured by mean opinion score, abbreviated MOS — the average rating listeners give synthetic speech on a one-to-five scale, the subjective method standardized in ITU-T Recommendation P.800. A MOS near 4.0 is the rough line above which most listeners stop noticing the voice is synthetic for narration; lip-synced dialogue is judged more harshly because the eye is watching the mouth.
The tools that package this end to end are crowded in 2026. ElevenLabs Dubbing covers 90-plus languages and bills each target language separately, with a Dubbing Studio that lets you edit the transcript, reassign speakers, and regenerate a single clip. HeyGen's Video Translate spans 175-plus languages with lip-sync, priced at a flat rate per minute on its creator tier. Rask AI covers 130-plus languages and meters per minute, which gets expensive at scale. Google's Aloud, built into YouTube Studio, dubs a narrow set of language pairs for free but without emotion or lip-sync. The right pick depends on language coverage, whether you need lip-sync, and how the per-minute math lands at your volume — which is the same unit-cost discipline we apply in the cost-model lesson.
Auto-Subtitles Done Right: The Format And The Reading-Speed Rules
Subtitles look like the easy mode, and producing wrong ones is indeed easy. Good ones obey two sets of rules: a file format the player understands, and timing rules a human can actually read.
Pick the right format
There are four formats that matter, and they are not interchangeable. The table below is the decision in one place.
| Format | What it is | Where it is used | Styling | Standard status |
|---|---|---|---|---|
| SRT (SubRip) | Plain text: number, timecode, line | Upload boxes, YouTube, editing tools | None | De-facto only; no formal spec |
| WebVTT | SRT's web successor, .vtt |
Browsers (HTML <track>), HLS, DASH |
Positioning, basic CSS | W3C Candidate Recommendation |
| TTML | XML timed-text, broadcast authoring | TV exchange, mastering | Rich (XML + CSS-like) | W3C Recommendation |
| IMSC1 | Constrained TTML profile for delivery | Broadcast/OTT in fMP4/CMAF | Per-cue colour, position, ruby | W3C Recommendation |
In practice: accept SRT on the way in because everyone has one, deliver WebVTT for web and adaptive streaming, and deliver IMSC1 — the broadcast-grade profile carried inside fragmented MP4 segments, a W3C Recommendation since 2018 — when you ship to OTT or television platforms. The captions-and-formats interplay across a streaming stack is covered in our captions and multi-audio article in the Streaming section.
Obey the reading-speed math
A subtitle that is technically correct but on screen too briefly is a failed subtitle. The industry codifies this with a reading-speed limit measured in characters per second (CPS), plus line-length and duration bounds. Netflix's widely-copied Timed Text Style Guide sets, for most languages, a maximum of about 17 CPS for adult content and 13 CPS for children's, at most two lines per cue, at most 42 characters per line, a minimum duration of five-sixths of a second, and a maximum of seven seconds.
Turn the reading-speed limit into the rule that drives cue-splitting. If a cue stays on screen for three seconds and your limit is 17 characters per second, the most text it may hold is:
max characters = 3.0 s × 17 CPS = 51 characters
So a translated line of 80 characters cannot live in a three-second cue: the pipeline must either split it across two cues or hold it longer, and if the speech only lasted three seconds, splitting is the only honest option. This is the arithmetic that an auto-subtitle stage runs on every cue — take the words and their timestamps from ASR, group them into cues, and keep splitting until every cue passes the CPS, line-count, line-length, and duration tests. Skip it and you get the auto-captions everyone has seen: a wall of text that vanishes before it can be read.
Figure 4. Subtitle formats and the reading-speed rule. Accept SRT, deliver WebVTT for web and IMSC1 for broadcast, and split every cue to the CPS limit.
Common mistake: shipping the raw ASR transcript as subtitles. The single most common localization failure is taking the timestamped transcript straight from the speech model and saving it as an SRT. The result violates every readability rule at once — cues that are too long, too fast, broken at the wrong place, and split mid-clause. ASR gives you words and times; it does not give you subtitles. The typesetting stage that turns one into the other — enforcing CPS, two-line maximum, 42-character lines, and clause-aware line breaks — is not optional polish. It is the difference between a caption and a readable caption, and it is exactly the work cheap "auto-subtitle" features skip.
The Quality Gate: Where Each Stage Can Fail Quietly
Every stage of this pipeline can fail in a way that looks fine to the machine and wrong to a human, so a localization feature needs an automated quality gate with a number at each stage, the same fail-closed discipline we lay out in the generative-video gates lesson. Set a word-error-rate ceiling on transcription, a COMET floor on translation, a reading-speed check on every subtitle cue, and a MOS or duration-fit check on generated audio. When a clip fails a threshold, route it to a human rather than shipping it — because a confidently wrong dub in a language your team does not speak is invisible to you and obvious to your audience. The cheapest insurance is a bilingual human spot-check on a sample of output, sized to your risk tolerance, with the whole sample reviewed for anything in a regulated vertical like telemedicine or e-learning.
The Law: Disclosure For Voices, Accessibility For Captions
Two distinct bodies of law touch this pipeline, and they pull in opposite directions — one says tell people it is synthetic, the other says make it available to everyone.
Cloned voices must be disclosed
A dub that clones a real person's voice produces synthetic audio of a real human — an audio deepfake. From 2 August 2026, the European Union's AI Act, Article 50, makes two demands that land here. Providers of systems generating synthetic audio must mark the output as artificially generated in a machine-readable form. Deployers who generate a deepfake — content of a real person that would falsely appear authentic — must disclose to viewers that it is artificial. Breaching these transparency rules sits in a penalty tier reaching 15 million euros or 3% of worldwide annual turnover. The practical engineering follows the same three-layer pattern as generative video — a machine-readable mark, a record, and a visible label — and it starts before generation, with consent: you capture the original speaker's permission to clone their voice and store that record, the discipline we detail in the voice cloning and consent lesson. Generic TTS voice-over carries none of this weight, because no real person is being imitated — another reason to choose the lightest mode that works.
Captions are an accessibility obligation
While disclosure law governs synthetic voices, accessibility law governs subtitles and captions — and it is older and well enforced. In the United States, the FCC sets four quality standards for closed captions on covered video: accuracy (the captions match the words and convey tone), synchronicity (they line up with the audio and stay readable), completeness (they run start to finish), and placement (they do not cover faces or on-screen text). The Twenty-First Century Communications and Video Accessibility Act, the CVAA, extends this to online video that previously aired on television, requiring captions "of at least the same quality" as the broadcast. The Web Content Accessibility Guidelines, WCAG, treat captions for prerecorded video as a baseline requirement for an accessible site. The point for your pipeline: the reading-speed and placement rules above are not just craft, they are the measurable core of a legal standard, and an auto-caption feature that ignores them can fail compliance as well as readability.
Where Fora Soft Fits In
We build the video products that localization features live inside — video conferencing, OTT and Internet TV platforms, e-learning systems, and telemedicine apps — and we have wired these pipelines into them since long before the AI versions existed. When a client wants their e-learning courses in eight languages, or their OTT catalog subtitled to a broadcast spec, or a conferencing product with live translated captions, the model is the easy part; the work is the timing fit, the reading-speed-correct subtitle stage, the speaker-aware dubbing, and the disclosure and accessibility layers that keep the feature lawful in each market. We treat the choice between subtitles, voice-over, and full dub as a product decision made per vertical and per budget, not a default — because in regulated verticals, the lightest compliant mode is usually the right one.
What To Read Next
- Streaming ASR in production — Deepgram, Whisper, AssemblyAI
- ElevenLabs pricing, voice cloning, and NO FAKES Act consent engineering
- Real-time multilingual speech translation in calls
Talk To Us / See Our Work / Download
- Talk to a video engineer — bring us your localization feature and we will design the pipeline, pick the right mode per market, and build the quality, disclosure, and accessibility gates. Book a 30-minute scoping call.
- See our case studies — video conferencing, OTT, e-learning, and telemedicine products we have shipped since 2005. View the portfolio.
- Download the Video Localization Pipeline Checklist — the three-mode decision, the six pipeline stages, the subtitle timing rules, the quality-gate thresholds, and the disclosure and accessibility obligations, on one page. Download the PDF.
References
- W3C — TTML Profiles for Internet Media Subtitles and Captions 1.1 (IMSC1.1), W3C Recommendation (text and image profiles; per-cue colour, positioning, ruby; carried in fMP4/CMAF; IMSC1.0.1 became a Recommendation in 2018, IMSC1.1 in 2020). https://www.w3.org/TR/ttml-imsc1.1/ — accessed 2026-06-01. Tier 1 (W3C Recommendation).
- W3C — WebVTT: The Web Video Text Tracks Format (cue syntax derived from SRT, positioning and styling, used by HTML
<track>, HLS, and DASH; Candidate Recommendation on the Recommendation track). https://www.w3.org/TR/webvtt1/ — accessed 2026-06-01. Tier 1 (W3C Candidate Recommendation). - W3C — Timed Text Markup Language (TTML) Profiles overview (TTML1/TTML2 as the broadcast authoring and exchange format; relationship to IMSC and WebVTT). https://www.w3.org/AudioVideo/TT/docs/TTML-Profiles.html — accessed 2026-06-01. Tier 1 (W3C).
- EU Artificial Intelligence Act — Article 50, Transparency Obligations (§2 machine-readable marking of synthetic audio/image/video/text by providers; §4 deepfake disclosure by deployers, including audio; in force 2 Aug 2026 per Article 113). https://artificialintelligenceact.eu/article/50/ — accessed 2026-06-01. Tier 1 (primary law; Regulation (EU) 2024/1689).
- ITU-T Recommendation P.800 — Methods for subjective determination of transmission quality (Mean Opinion Score / MOS; 1–5 absolute category rating used to score synthetic and transmitted speech quality). https://www.itu.int/rec/T-REC-P.800 — accessed 2026-06-01. Tier 1 (ITU-T standard).
- Netflix Partner Help Center — Timed Text Style Guide: General Requirements and Subtitle Timing Guidelines (reading speed ~17 CPS adult / 13 CPS children, max 42 characters per line, max two lines per cue, min 5/6 s and max 7 s per event). https://partnerhelp.netflixstudios.com/hc/en-us/articles/215758617-Timed-Text-Style-Guide-General-Requirements — accessed 2026-06-01. Tier 4 (industry deployer style guide; de-facto subtitle craft standard).
- US FCC — Closed Captioning Quality Standards for Television Programming (four factors: accuracy, synchronicity, completeness, placement) and CVAA (21st Century Communications and Video Accessibility Act of 2010) requiring online re-runs of TV captioned "at least the same quality"; caption-settings accessibility deadline 17 Aug 2026. https://www.fcc.gov/general/closed-captioning-video-programming-television — accessed 2026-06-01. Tier 1 (regulator).
- Effa et al. / Apple Machine Learning — "Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing" (arXiv 2302.12979): isochrony as joint optimization of translation and speech duration rather than length control then rate adjustment. https://arxiv.org/abs/2302.12979 — accessed 2026-06-01. Tier 5 (peer-reviewed/academic).
- Wu et al. — "VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing" (arXiv 2211.16934): speech-duration-aware length control inside the translation model to match source and target speech length. https://arxiv.org/abs/2211.16934 — accessed 2026-06-01. Tier 5 (academic).
- RWS — "AI dubbing in 2026: the complete guide" and corroborating 2026 cost reporting (AI dubbing ~$1–20/min vs human studio ~$20–40/min and $5,000–15,000 per content-hour per language; ~90% cost reduction; minutes-to-hours vs weeks-to-months turnaround). https://www.rws.com/blog/ai-dubbing-in-2026/ — accessed 2026-06-01. Tier 6 (vendor/industry analysis; year-labelled, directional cost figures).
- ElevenLabs — Dubbing capability and pricing documentation (90+ languages, per-target-language billing, Dubbing Studio transcript editing / speaker reassignment / per-clip regeneration). https://elevenlabs.io/docs/overview/capabilities/dubbing — accessed 2026-06-01. Tier 4 (vendor primary).


