Published 2026-06-01 · 18 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Localization used to be a budget line you approved once a year and a vendor you emailed; it is now a feature you ship inside your own product, and someone on your team has to make it work. This lesson is written for the product manager, founder, or engineering lead who needs to add "watch this in Spanish" or "turn on subtitles" to a video platform and wants to understand the moving parts well enough to scope the work, judge a vendor, and avoid the failures that make a localized video worse than no localization at all. It sits downstream of the speech lessons — the streaming ASR lesson and the WhisperX diarization lesson for transcription, the voice cloning lesson and the streaming TTS lesson for the voice — and it assembles those parts into one shippable feature.

Three Ways To Cross A Language Barrier

Before any code, decide which of three products you are building, because they cost different amounts, carry different risks, and serve different viewers. People say "translate my video" and mean one of three very different things.

The lightest is subtitles: the original audio plays untouched, and translated text appears at the bottom of the screen. The viewer hears the real voice and reads the meaning. This is the cheapest to produce, the safest legally — no one's voice is cloned — and the easiest to get right, but it asks the viewer to read, which not every audience will do.

The middle option is voice-over, sometimes called UN-style or lectoring. A synthetic narrator speaks the translation over the top while the original audio is lowered (ducked) but still faintly audible underneath. The viewer hears the new language without having to read, and the lips on screen no longer match — but for documentaries, news, training videos, and explainer content, nobody minds, because the speaker's mouth was never the point.

The heaviest is a full dub: the original voice is removed and replaced with a translated voice, ideally cloned to sound like the original speaker, and ideally timed and lip-synced so the mouth movements line up. This is the cinema-grade option. It is the most expensive, the most error-prone, and the one that triggers the law, because a cloned voice of a real person is a deepfake.

The mistake teams make is reaching for the full dub by default because it sounds the most impressive, when subtitles or voice-over would serve the audience better at a fraction of the cost and risk. Pick the lightest mode that does the job.

Comparison diagram of the three localization modes. Three labelled columns — Subtitles, Voice-over, Full dub — compared across rows: what the viewer hears, lip-sync, relative cost, legal risk, and best use. Subtitles: original voice plus on-screen text, no lip-sync issue, lowest cost, lowest risk, best for budget and reading audiences. Voice-over: synthetic narrator over ducked original, lips do not match but it does not matter, medium cost, medium risk, best for documentary, news, training. Full dub: replaced cloned voice with lip-sync, highest cost, highest risk because a cloned real voice is a deepfake, best for film and flagship content. A footer note advises picking the lightest mode that does the job. Figure 1. The three localization modes. They look like one feature but are three products with different costs, risks, and audiences.

One Pipeline, Three Outputs

Here is the idea that organizes everything below: all three modes run through the same front half of one pipeline, then branch. The shared front half answers two questions — what was said and what does it mean in the target language — and every mode needs both answers. The branch is only about how the answer is delivered: as text on screen, as a voice over the top, or as a replacement voice.

Walk the stages left to right. A video comes in. Transcription turns its speech into text with a timestamp on every word and a label for who spoke. Translation turns that text into the target language while trying to keep it the right length. Then the pipeline forks. For subtitles, a typesetting stage chops the translation into on-screen cues that obey reading-speed rules and writes a subtitle file. For voice-over and dubbing, a speech stage turns the translation into audio with a text-to-speech or voice-cloning model, a timing stage stretches or rephrases it to fit the original duration, an optional lip-sync stage warps the mouth, and a mux stage muxes the new audio back onto the video.

The reason to see it as one pipeline is practical: the expensive, accuracy-critical work — recognition and translation — is shared, so you build it once and reuse it across all three products. The cheap-to-describe, easy-to-underestimate work — timing — is where each mode succeeds or fails.

Pipeline diagram showing one shared front half branching into three outputs. A video enters at the left into a shared spine of two boxes: Transcription (ASR; word-level timestamps plus speaker labels) and Translation (length-controlled machine translation). After translation the flow forks into three horizontal branches. Top branch, Subtitles: a Typesetting box (cue splitting, reading-speed rules) writing an SRT / WebVTT / IMSC file. Middle branch, Voice-over: a Text-to-speech box then a Timing-fit box then a Mux-over-ducked-original box. Bottom branch, Full dub: a Voice-clone box then a Timing-fit box then a Lip-sync box then a Replace-audio mux box. A caption notes the front half is shared and built once; the branches differ only in delivery. Figure 2. One pipeline, three outputs. Recognition and translation are shared; only the delivery branch changes.

Stage One: Transcription — Get The Words And The Clock

The pipeline starts by turning speech into text, a job done by an automatic speech recognition model — software that listens to audio and writes down the words, abbreviated ASR. The dominant open model in 2026 is OpenAI's Whisper and its production wrapper WhisperX, which we cover in the WhisperX lesson; managed services such as Deepgram and AssemblyAI cover the same job, and we compare them in the streaming ASR lesson.

For localization, ASR has to produce three things, not one. The obvious output is the words. The two outputs people forget are the clock and the cast. The clock is a timestamp on every word — the moment it starts and ends — because both subtitles and dubbing have to be placed in time, and a transcript without timing is useless for either. The cast is a speaker label — "this sentence was Anna, that one was Boris" — produced by a step called diarization, which matters the moment a video has more than one person talking, because a dub needs a different cloned voice per speaker.

The quality of this stage is measured by word error rate, abbreviated WER, which counts the mistakes per hundred words. The formula adds up three kinds of error and divides by the number of words that were actually said:

WER = (substitutions + deletions + insertions) / words_in_reference

If a 100-word clip comes back with 3 wrong words, 1 dropped word, and 1 invented word, the arithmetic is:

WER = (3 + 1 + 1) / 100 = 0.05 = 5%

A 5% word error rate sounds small, but every one of those errors flows downstream: a misheard word gets mistranslated and then spoken confidently in the wrong language. The transcription stage is where errors are cheapest to catch and most expensive to ignore, which is why a serious pipeline lets a human correct the transcript before translation, especially for proper names and jargon the model has never heard.

Stage Two: Translation — And The Length Problem That Breaks Naive Pipelines

Translation is the stage everyone expects to be hard and that modern models mostly handle well. The trap is somewhere else: length. Languages do not use the same number of syllables to say the same thing. A line that takes three seconds in English can take three and three-quarter seconds in German or Spanish — translation expansion of around 20–30% is normal for those pairs. The translation is correct; it simply does not fit the time the original speaker took to say it.

For subtitles this means a cue that should sit on screen for three seconds now holds more text than a viewer can read in three seconds. For dubbing it is worse: the synthetic voice has to cram a longer sentence into the same slot, so it either speeds up — the chipmunk dub everyone has heard and hated — or it overruns and collides with the next line.

The engineering term for the goal is isochrony: the translated speech should be time-aligned with the original, matching not just the overall length but the pauses and, for lip-sync, the mouth movements. There are two ways to get there, and good pipelines use both.

The first is length-controlled translation: ask the translation model for a version that is not just accurate but close to a target length, measured in characters or estimated speaking time. Research systems such as isochrony-aware machine translation and VideoDubber bake the duration target into the translation step itself, so the model prefers a shorter synonym over a longer one when both are correct. The second is bounded time-stretching: after the voice is generated, stretch or compress the audio to fit — but only within a narrow band, because past roughly ±10–15% the human ear hears the distortion. If the translation is still too long after both, the pipeline shortens the wording rather than speeding the voice past the point of naturalness.

Translation quality is scored two ways. The old standard is BLEU, which counts how many word sequences the machine output shares with a human reference translation; it is fast and cheap but crude. The 2026 default for anything serious is COMET, a neural metric that was trained to predict human quality judgments and correlates with them far better than BLEU. Use BLEU for a quick regression check and COMET for a real quality gate.

Diagram of the translation length and isochrony problem. Top: a timeline bar three seconds long labelled with the English source line that fits exactly. Below it, the same three-second slot with a German translation bar that is 25% longer and overflows the slot, labelled translation expansion. Bottom: two fixes shown side by side. Fix one, length-controlled translation: a shorter synonym chosen so the target bar fits the slot. Fix two, bounded time-stretch: the overflowing bar compressed to fit, with a warning band marking plus or minus 10 to 15 percent as the limit before the voice sounds distorted. A note states that if both fixes are not enough, shorten the wording rather than speed the voice. Figure 3. The length problem and its two fixes. Translation expansion is normal; isochrony is the goal; ±10–15% is the time-stretch a listener will tolerate.

Stage Three: The Voice — Synthetic Narrator Or Cloned Speaker

Once you have a translation that fits, you turn it into sound. Two paths exist, and the choice carries both a quality and a legal weight.

The plain path is text-to-speech, abbreviated TTS — software that reads text aloud in a generic but natural voice you pick from a library. This is right for voice-over: a documentary narrator, a training-course presenter, a news read. There is no real person's voice being copied, so there is no consent question. We compare the production TTS engines — ElevenLabs, Kokoro, Cartesia, OpenAI — in the streaming TTS lesson.

The heavier path is voice cloning — capturing the timbre and cadence of the original speaker from a short sample so the dub sounds like them speaking the new language. This is what makes a full dub feel native, and it is also what turns the output into a deepfake of a real person, with the consent and disclosure duties that follow. We go deep on the cloning mechanics and the consent engineering in the voice cloning lesson.

Voice quality is measured by mean opinion score, abbreviated MOS — the average rating listeners give synthetic speech on a one-to-five scale, the subjective method standardized in ITU-T Recommendation P.800. A MOS near 4.0 is the rough line above which most listeners stop noticing the voice is synthetic for narration; lip-synced dialogue is judged more harshly because the eye is watching the mouth.

The tools that package this end to end are crowded in 2026. ElevenLabs Dubbing covers 90-plus languages and bills each target language separately, with a Dubbing Studio that lets you edit the transcript, reassign speakers, and regenerate a single clip. HeyGen's Video Translate spans 175-plus languages with lip-sync, priced at a flat rate per minute on its creator tier. Rask AI covers 130-plus languages and meters per minute, which gets expensive at scale. Google's Aloud, built into YouTube Studio, dubs a narrow set of language pairs for free but without emotion or lip-sync. The right pick depends on language coverage, whether you need lip-sync, and how the per-minute math lands at your volume — which is the same unit-cost discipline we apply in the cost-model lesson.

Auto-Subtitles Done Right: The Format And The Reading-Speed Rules

Subtitles look like the easy mode, and producing wrong ones is indeed easy. Good ones obey two sets of rules: a file format the player understands, and timing rules a human can actually read.

Pick the right format

There are four formats that matter, and they are not interchangeable. The table below is the decision in one place.

Format What it is Where it is used Styling Standard status
SRT (SubRip) Plain text: number, timecode, line Upload boxes, YouTube, editing tools None De-facto only; no formal spec
WebVTT SRT's web successor, .vtt Browsers (HTML <track>), HLS, DASH Positioning, basic CSS W3C Candidate Recommendation
TTML XML timed-text, broadcast authoring TV exchange, mastering Rich (XML + CSS-like) W3C Recommendation
IMSC1 Constrained TTML profile for delivery Broadcast/OTT in fMP4/CMAF Per-cue colour, position, ruby W3C Recommendation

In practice: accept SRT on the way in because everyone has one, deliver WebVTT for web and adaptive streaming, and deliver IMSC1 — the broadcast-grade profile carried inside fragmented MP4 segments, a W3C Recommendation since 2018 — when you ship to OTT or television platforms. The captions-and-formats interplay across a streaming stack is covered in our captions and multi-audio article in the Streaming section.

Obey the reading-speed math

A subtitle that is technically correct but on screen too briefly is a failed subtitle. The industry codifies this with a reading-speed limit measured in characters per second (CPS), plus line-length and duration bounds. Netflix's widely-copied Timed Text Style Guide sets, for most languages, a maximum of about 17 CPS for adult content and 13 CPS for children's, at most two lines per cue, at most 42 characters per line, a minimum duration of five-sixths of a second, and a maximum of seven seconds.

Turn the reading-speed limit into the rule that drives cue-splitting. If a cue stays on screen for three seconds and your limit is 17 characters per second, the most text it may hold is:

max characters = 3.0 s × 17 CPS = 51 characters

So a translated line of 80 characters cannot live in a three-second cue: the pipeline must either split it across two cues or hold it longer, and if the speech only lasted three seconds, splitting is the only honest option. This is the arithmetic that an auto-subtitle stage runs on every cue — take the words and their timestamps from ASR, group them into cues, and keep splitting until every cue passes the CPS, line-count, line-length, and duration tests. Skip it and you get the auto-captions everyone has seen: a wall of text that vanishes before it can be read.

Subtitle format comparison and the reading-speed rule. On the left, a four-row table compares SRT, WebVTT, TTML, and IMSC1 by what each is, where it is used, its styling, and its W3C standard status: SRT is de-facto only with no styling, WebVTT is a W3C Candidate Recommendation for browsers and adaptive streaming, and TTML and IMSC1 are W3C Recommendations for broadcast authoring and delivery. On the right, a reading-speed card defines characters per second, shows the worked example 3.0 seconds times 17 CPS equals 51 characters, and lists the Netflix-style bounds — about 17 CPS for adults and 13 for children, 42 characters per line, two lines per cue, and five-sixths to seven seconds per cue — with a note that a cue past the limit is unreadable and can fail FCC and CVAA accessibility law. Figure 4. Subtitle formats and the reading-speed rule. Accept SRT, deliver WebVTT for web and IMSC1 for broadcast, and split every cue to the CPS limit.

Common mistake: shipping the raw ASR transcript as subtitles. The single most common localization failure is taking the timestamped transcript straight from the speech model and saving it as an SRT. The result violates every readability rule at once — cues that are too long, too fast, broken at the wrong place, and split mid-clause. ASR gives you words and times; it does not give you subtitles. The typesetting stage that turns one into the other — enforcing CPS, two-line maximum, 42-character lines, and clause-aware line breaks — is not optional polish. It is the difference between a caption and a readable caption, and it is exactly the work cheap "auto-subtitle" features skip.

The Quality Gate: Where Each Stage Can Fail Quietly

Every stage of this pipeline can fail in a way that looks fine to the machine and wrong to a human, so a localization feature needs an automated quality gate with a number at each stage, the same fail-closed discipline we lay out in the generative-video gates lesson. Set a word-error-rate ceiling on transcription, a COMET floor on translation, a reading-speed check on every subtitle cue, and a MOS or duration-fit check on generated audio. When a clip fails a threshold, route it to a human rather than shipping it — because a confidently wrong dub in a language your team does not speak is invisible to you and obvious to your audience. The cheapest insurance is a bilingual human spot-check on a sample of output, sized to your risk tolerance, with the whole sample reviewed for anything in a regulated vertical like telemedicine or e-learning.

The Law: Disclosure For Voices, Accessibility For Captions

Two distinct bodies of law touch this pipeline, and they pull in opposite directions — one says tell people it is synthetic, the other says make it available to everyone.

Cloned voices must be disclosed

A dub that clones a real person's voice produces synthetic audio of a real human — an audio deepfake. From 2 August 2026, the European Union's AI Act, Article 50, makes two demands that land here. Providers of systems generating synthetic audio must mark the output as artificially generated in a machine-readable form. Deployers who generate a deepfake — content of a real person that would falsely appear authentic — must disclose to viewers that it is artificial. Breaching these transparency rules sits in a penalty tier reaching 15 million euros or 3% of worldwide annual turnover. The practical engineering follows the same three-layer pattern as generative video — a machine-readable mark, a record, and a visible label — and it starts before generation, with consent: you capture the original speaker's permission to clone their voice and store that record, the discipline we detail in the voice cloning and consent lesson. Generic TTS voice-over carries none of this weight, because no real person is being imitated — another reason to choose the lightest mode that works.

Captions are an accessibility obligation

While disclosure law governs synthetic voices, accessibility law governs subtitles and captions — and it is older and well enforced. In the United States, the FCC sets four quality standards for closed captions on covered video: accuracy (the captions match the words and convey tone), synchronicity (they line up with the audio and stay readable), completeness (they run start to finish), and placement (they do not cover faces or on-screen text). The Twenty-First Century Communications and Video Accessibility Act, the CVAA, extends this to online video that previously aired on television, requiring captions "of at least the same quality" as the broadcast. The Web Content Accessibility Guidelines, WCAG, treat captions for prerecorded video as a baseline requirement for an accessible site. The point for your pipeline: the reading-speed and placement rules above are not just craft, they are the measurable core of a legal standard, and an auto-caption feature that ignores them can fail compliance as well as readability.

Where Fora Soft Fits In

We build the video products that localization features live inside — video conferencing, OTT and Internet TV platforms, e-learning systems, and telemedicine apps — and we have wired these pipelines into them since long before the AI versions existed. When a client wants their e-learning courses in eight languages, or their OTT catalog subtitled to a broadcast spec, or a conferencing product with live translated captions, the model is the easy part; the work is the timing fit, the reading-speed-correct subtitle stage, the speaker-aware dubbing, and the disclosure and accessibility layers that keep the feature lawful in each market. We treat the choice between subtitles, voice-over, and full dub as a product decision made per vertical and per budget, not a default — because in regulated verticals, the lightest compliant mode is usually the right one.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer — bring us your localization feature and we will design the pipeline, pick the right mode per market, and build the quality, disclosure, and accessibility gates. Book a 30-minute scoping call.
  • See our case studies — video conferencing, OTT, e-learning, and telemedicine products we have shipped since 2005. View the portfolio.
  • Download the Video Localization Pipeline Checklist — the three-mode decision, the six pipeline stages, the subtitle timing rules, the quality-gate thresholds, and the disclosure and accessibility obligations, on one page. Download the PDF.

References

  1. W3C — TTML Profiles for Internet Media Subtitles and Captions 1.1 (IMSC1.1), W3C Recommendation (text and image profiles; per-cue colour, positioning, ruby; carried in fMP4/CMAF; IMSC1.0.1 became a Recommendation in 2018, IMSC1.1 in 2020). https://www.w3.org/TR/ttml-imsc1.1/ — accessed 2026-06-01. Tier 1 (W3C Recommendation).
  2. W3C — WebVTT: The Web Video Text Tracks Format (cue syntax derived from SRT, positioning and styling, used by HTML <track>, HLS, and DASH; Candidate Recommendation on the Recommendation track). https://www.w3.org/TR/webvtt1/ — accessed 2026-06-01. Tier 1 (W3C Candidate Recommendation).
  3. W3C — Timed Text Markup Language (TTML) Profiles overview (TTML1/TTML2 as the broadcast authoring and exchange format; relationship to IMSC and WebVTT). https://www.w3.org/AudioVideo/TT/docs/TTML-Profiles.html — accessed 2026-06-01. Tier 1 (W3C).
  4. EU Artificial Intelligence Act — Article 50, Transparency Obligations (§2 machine-readable marking of synthetic audio/image/video/text by providers; §4 deepfake disclosure by deployers, including audio; in force 2 Aug 2026 per Article 113). https://artificialintelligenceact.eu/article/50/ — accessed 2026-06-01. Tier 1 (primary law; Regulation (EU) 2024/1689).
  5. ITU-T Recommendation P.800 — Methods for subjective determination of transmission quality (Mean Opinion Score / MOS; 1–5 absolute category rating used to score synthetic and transmitted speech quality). https://www.itu.int/rec/T-REC-P.800 — accessed 2026-06-01. Tier 1 (ITU-T standard).
  6. Netflix Partner Help Center — Timed Text Style Guide: General Requirements and Subtitle Timing Guidelines (reading speed ~17 CPS adult / 13 CPS children, max 42 characters per line, max two lines per cue, min 5/6 s and max 7 s per event). https://partnerhelp.netflixstudios.com/hc/en-us/articles/215758617-Timed-Text-Style-Guide-General-Requirements — accessed 2026-06-01. Tier 4 (industry deployer style guide; de-facto subtitle craft standard).
  7. US FCC — Closed Captioning Quality Standards for Television Programming (four factors: accuracy, synchronicity, completeness, placement) and CVAA (21st Century Communications and Video Accessibility Act of 2010) requiring online re-runs of TV captioned "at least the same quality"; caption-settings accessibility deadline 17 Aug 2026. https://www.fcc.gov/general/closed-captioning-video-programming-television — accessed 2026-06-01. Tier 1 (regulator).
  8. Effa et al. / Apple Machine Learning — "Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing" (arXiv 2302.12979): isochrony as joint optimization of translation and speech duration rather than length control then rate adjustment. https://arxiv.org/abs/2302.12979 — accessed 2026-06-01. Tier 5 (peer-reviewed/academic).
  9. Wu et al. — "VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing" (arXiv 2211.16934): speech-duration-aware length control inside the translation model to match source and target speech length. https://arxiv.org/abs/2211.16934 — accessed 2026-06-01. Tier 5 (academic).
  10. RWS — "AI dubbing in 2026: the complete guide" and corroborating 2026 cost reporting (AI dubbing ~$1–20/min vs human studio ~$20–40/min and $5,000–15,000 per content-hour per language; ~90% cost reduction; minutes-to-hours vs weeks-to-months turnaround). https://www.rws.com/blog/ai-dubbing-in-2026/ — accessed 2026-06-01. Tier 6 (vendor/industry analysis; year-labelled, directional cost figures).
  11. ElevenLabs — Dubbing capability and pricing documentation (90+ languages, per-target-language billing, Dubbing Studio transcript editing / speaker reassignment / per-clip regeneration). https://elevenlabs.io/docs/overview/capabilities/dubbing — accessed 2026-06-01. Tier 4 (vendor primary).