Published: 2026-06-06 · Reading time: 21 min read · Author: Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you run a streaming product, a localization pipeline, or an OTT platform, the audio side of "ship it in seven languages" is where small labelling mistakes turn into the wrong track playing, an accessibility audit failing, or a regulator's letter arriving. A product manager scoping localization, an operations lead approving a delivery spec, and an engineer packaging the manifest all touch the same set of tags, and the rules for those tags come from the streaming standards (HLS and DASH) and from accessibility law (the United States CVAA, the European Accessibility Act). This article explains what each track type is in plain language, shows the exact manifest signalling that makes a player select it correctly, and gives you the storage arithmetic so the "multi-track tax" is a number you can plan around rather than a surprise on the cloud bill.
The problem in one sentence
One video, many soundtracks. A film shot in English might ship with the original English mix, a French dub, a Spanish dub, a German dub, and — separately — an English track that describes the on-screen action for viewers who cannot see it. The picture is identical for everyone. Only the audio changes. The streaming system's job is to store all of those soundtracks once, deliver only the one the viewer wants, and switch between them without re-downloading the video.
That last point is the whole reason this works at all. In a modern adaptive stream the video and the audio are kept in separate files, and the player stitches them together at playback time. (If the idea of separating audio and video into independent tracks inside a container is new, the companion piece on how MP4, MKV, fMP4 and MPEG-TS carry audio walks through the mechanics.) Because audio is independent, adding a tenth language adds one more small audio file — not a tenth full copy of the movie.
The catch is that the player has to be told what each audio file is. It needs to know which one is French, which one is the original, which one is the described version, and which one to start with if the viewer has expressed no preference. Those instructions live in a text file called the manifest, and the rest of this article is about how to write them.
The four track types you will actually ship
Before the signalling, the vocabulary. There are four kinds of audio track that show up in real catalogues, and they are easy to confuse because three of them are "just another language."
The original-language track is the soundtrack as the title was produced — English for a Hollywood film, Korean for a Korean drama. It is the reference everything else is measured against.
A dub is a full replacement soundtrack in another language, where translated dialogue is mixed back over the original music and effects. A French viewer of an English film hears French actors over the same explosions and score. A dub is a complete, standalone mix: the player swaps the whole soundtrack, it does not layer anything.
A descriptive audio track — called audio description (AD) in Europe and described video service (DVS) in North America — is a soundtrack for blind and low-vision viewers. On top of the normal dialogue, music, and effects, a narrator describes what is happening on screen during the gaps in dialogue: "She slips the letter into her coat and walks out into the rain." It is a separate, complete soundtrack, not a setting on the normal one.
A sign-language track is the odd one out, because the "audio" accessibility need here is actually visual: a sign-language interpreter signing the dialogue. In streaming this is usually carried as an extra video track (a small inset of the interpreter) rather than an audio track, and we cover where it fits at the end. We include it here because product teams group it with "the accessibility audio versions" even though the plumbing differs.
The reason the distinction matters: a dub and a descriptive track are both "an English-language audio file," yet they must never be confused by the player. A sighted French viewer wants the French dub; a blind English viewer wants the English described track; neither wants the other. The manifest labels are what keep them apart.
Figure 1. One video, many soundtracks. The video is stored once; each audio version is a small independent file. The player selects one audio track using the labels in the manifest. Sign-language interpretation rides as a separate video track, not an audio track.
How HLS labels audio tracks
HLS — HTTP Live Streaming, Apple's streaming format — describes alternate audio with a tag called EXT-X-MEDIA. One EXT-X-MEDIA line declares one rendition (one track), and a group of them with the same GROUP-ID tells the player "these are the audio choices for this video." The player then plays exactly one of them alongside the video.
The attributes that decide selection are defined in the HLS specification, IETF RFC 8216, section 4.3.4.1. The ones that matter for multi-language and accessibility are these.
LANGUAGE carries a language tag such as en, fr, or es. NAME is the human-readable label the user sees in the audio menu — "Français", "English (described)" — and RFC 8216 makes NAME a required attribute on every rendition. DEFAULT is an enumerated yes/no value; when it is YES, the specification says the client "SHOULD play this Rendition of the content in the absence of information from the user indicating a different choice." In plain terms: this is the track that plays if the viewer never picks one.
AUTOSELECT is the subtler one. When it is YES, the client "MAY choose to play this Rendition in the absence of explicit user preference because it matches the current playback environment, such as chosen system language." So a French viewer whose device language is French can land on the French dub automatically — but only if you mark that track AUTOSELECT=YES. RFC 8216 adds a binding rule worth memorizing: "If the AUTOSELECT attribute is present, its value MUST be YES if the value of the DEFAULT attribute is YES." A default track is, by definition, auto-selectable.
The accessibility signal is CHARACTERISTICS. It is a comma-separated list of identifiers, and for audio the relevant one is public.accessibility.describes-video. RFC 8216 states plainly: "An AUDIO Rendition MAY include the following characteristic: 'public.accessibility.describes-video'." That single string is how a player, and an accessibility audit tool, recognizes your English descriptive track as the described version rather than a second English dub.
Here is a minimal HLS audio group with an original, a dub, and a described track:
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud",LANGUAGE="en",NAME="English",DEFAULT=YES,AUTOSELECT=YES,URI="en/audio.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud",LANGUAGE="fr",NAME="Français",DEFAULT=NO,AUTOSELECT=YES,URI="fr/audio.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud",LANGUAGE="en",NAME="English (described)",DEFAULT=NO,AUTOSELECT=NO,CHARACTERISTICS="public.accessibility.describes-video",URI="en-ad/audio.m3u8"
Read it as three sentences. The first track is English, it is the default, and it is auto-selectable. The second is French, not the default, but auto-selectable for French-language devices. The third is English, never the default, never auto-selected — it plays only when the viewer explicitly asks for description — and it is tagged as a video-describing track. Note one common-sense rule baked into the standard: the FORCED attribute, which marks essential content that plays automatically, is allowed only on subtitle tracks, never on audio — RFC 8216 says "The FORCED attribute MUST NOT be present unless the TYPE is SUBTITLES."
A frequent mistake here is leaving the described track AUTOSELECT=YES. A French-language device with description preference off can then auto-select the English described track over the French dub, and the viewer hears narration they never asked for. Mark descriptive tracks AUTOSELECT=NO and surface them only when the viewer turns description on.
How DASH labels audio tracks
DASH — Dynamic Adaptive Streaming over HTTP, the ISO/IEC standard 23009-1 — uses a different but parallel structure. Each independent audio version is an AdaptationSet, and the purpose of that set is declared with a Role descriptor. The role values come from a published list, the scheme urn:mpeg:dash:role:2011, defined in ISO/IEC 23009-1 and elaborated in the DASH-IF interoperability guidelines (Part 8: Audio).
The role values you will use for multi-language and accessibility are main (the primary, fully-presentable soundtrack), alternate (another fully-presentable soundtrack, such as a dub), dub (a track explicitly identified as a translated/dubbed version), commentary (a track meant to be mixed with another, like a director's commentary or a receiver-mix description), and description (a track that describes the video for accessibility). A <Label> element carries the human-readable menu name, and the lang attribute on the AdaptationSet carries the language.
The DASH-IF guidance is precise about when a track is main/alternate versus commentary. If an adaptation set is a full audio service intended for direct presentation, its role is main or alternate. If it must be decoded and mixed with another full service before presentation — the "receiver-mix" pattern we explain next — its role is commentary. The description role is how a player tells a descriptive track apart from a plain dub when both are in the same language. As one widely-cited summary of the DASH-IF guidelines puts it, the Role descriptor "is how a player knows that one of three English AdaptationSets is the descriptive-audio track for accessibility, not the dub."
A trimmed DASH manifest with an original, a French dub, and an English description looks like this:
<AdaptationSet contentType="audio" lang="en">
<Role schemeIdUri="urn:mpeg:dash:role:2011" value="main"/>
<Label>English</Label>
<!-- representations … -->
</AdaptationSet>
<AdaptationSet contentType="audio" lang="fr">
<Role schemeIdUri="urn:mpeg:dash:role:2011" value="dub"/>
<Label>Français</Label>
</AdaptationSet>
<AdaptationSet contentType="audio" lang="en">
<Role schemeIdUri="urn:mpeg:dash:role:2011" value="description"/>
<Accessibility schemeIdUri="urn:tva:metadata:cs:AudioPurposeCS:2007" value="1"/>
<Label>English (described)</Label>
</AdaptationSet>
The structure mirrors HLS one-for-one: a default-ish main track, a dub flagged as a dub, and a description flagged as a description with an extra Accessibility descriptor (value 1 in the standard TV-Anytime audio-purpose scheme means "visually impaired, described"). If you also publish a DASH stream, the cross-section deep dive on the MPEG-DASH manifest structure covers periods, adaptation sets, and representations in full.
Figure 2. The same three tracks in HLS and DASH. The structures differ but map cleanly: HLS DEFAULT/AUTOSELECT/CHARACTERISTICS correspond to DASH Role plus an Accessibility descriptor.
Two ways to deliver described audio: broadcast-mix vs receiver-mix
Audio description comes in two delivery shapes, and the choice drives both your storage cost and which players can handle it.
In the broadcast-mix model, the descriptive narration is already blended into a complete soundtrack at the studio. You ship a full, standalone audio file: dialogue, music, effects, and the narration, all mixed together. The player treats it exactly like a dub — it is just another complete track to swap in. This is the model HLS uses, and it is the simplest: any player that can switch audio tracks can play it, because there is nothing special to do at decode time. The cost is storage. A 90-minute film that offers description in five languages stores five extra full soundtracks.
In the receiver-mix model, you ship only the narration — a thin track of the describer's voice and silence — plus instructions for how loudly to mix it over the normal soundtrack. The player decodes two streams and combines them in real time. This is far cheaper to store, because the narration track is mostly silence and carries no music or effects, and it lets the viewer adjust the description volume independently. DASH supports this through the commentary role and dedicated receiver-mix signalling; it is common in European broadcast (DVB) and DTV systems. The cost is player capability: the device must be able to decode and mix two audio streams at once, which not every web or TV player does.
The practical rule: if your reach includes web browsers and a wide range of smart TVs, broadcast-mix described tracks are the safe default because they need nothing special from the player. If you control the player and storage cost dominates — many titles, many languages — receiver-mix is the efficient choice. Large catalogue services like Netflix support both patterns depending on territory and device.
Figure 3. Broadcast-mix ships one complete described soundtrack (simple for players, heavier to store). Receiver-mix ships a narration-only track that the player blends with the main mix (light to store, needs a capable player).
What the law actually requires
Accessibility audio is not optional in the two largest media markets. The details matter because compliance is measured in specific numbers, on specific dates.
In the United States, audio description is mandated under the Twenty-First Century Communications and Video Accessibility Act (the CVAA), implemented by the Federal Communications Commission in the rule at 47 CFR 79.3. Major broadcast affiliates (ABC, CBS, Fox, NBC) and large cable and satellite systems must provide 87.5 hours of audio-described programming per calendar quarter — roughly seven hours a week — of which 50 hours must be prime-time or children's programming. The geographic reach expands on a schedule: as of January 1, 2026, the rule covers television markets ranked 111 through 120, and it continues to expand by ten markets a year. For non-broadcast networks, the FCC names the top five every three years; the current list (set July 1, 2024) is TLC, HGTV, Hallmark, TNT, and TBS, with TNT and TBS holding a limited waiver until June 30, 2027.
In Europe, the European Accessibility Act (EAA) came into force on June 28, 2025. It makes accessibility features — including audio description where applicable — mandatory for a broad set of digital services, and audiovisual media services fall within its scope. The technical yardstick is the harmonized standard EN 301 549, which references the Web Content Accessibility Guidelines (WCAG). The EAA works alongside the Audiovisual Media Services Directive (AVMSD), which requires member states to encourage providers to make content progressively accessible through subtitling, audio description, and sign language. The EAA grants a transitional period for some pre-existing content, but new on-demand catalogues are expected to comply.
The accessibility guidelines themselves draw a useful line. Under WCAG, providing audio description for prerecorded video is a Level AA requirement (success criterion 1.2.5) — the level most laws target. Providing a sign-language interpretation of prerecorded audio is success criterion 1.2.6, set at Level AAA, the aspirational tier. That difference explains why descriptive audio is treated as a baseline legal obligation while sign-language video is more often a value-add.
The practical consequence for engineering: described tracks are not a "nice to have" you add later. If your catalogue serves the United States or the European Union, the described track and its correct manifest labelling are part of the delivery spec, and an accessibility audit will check that public.accessibility.describes-video (HLS) or the description role and audio-purpose descriptor (DASH) are actually present and correctly set.
Sign-language audio: the visual accessibility track
Sign-language interpretation is grouped with accessibility audio in product planning, but technically it is a video problem. The interpreter signs the dialogue, so the accessibility content is a moving picture, not a sound.
There are two ways to deliver it. The simpler way, open sign language, edits the interpreter permanently into a small inset window inside the main video — a picture-in-picture box, usually in a lower corner, sometimes offset so it does not cover on-screen text. Every viewer sees it; it cannot be turned off. The flexible way, closed sign language, ships the interpreter as a separate, switchable video track (in DASH, a separate video AdaptationSet, often carrying the sign role) that the player can show or hide and, ideally, enlarge to full screen on demand. Both approaches work with standard HLS and DASH players; the closed approach is friendlier because viewers who do not need it are not forced to watch it.
The reason this lives at Level AAA rather than AA in the guidelines is partly cost and partly the smaller affected audience, but the engineering pattern is worth knowing: when a product team asks for "sign-language support," they are asking for an extra video rendition and the player UI to toggle it, not an extra entry in the audio group.
The storage and CDN math
Every extra soundtrack costs bytes to store and bytes to deliver. The arithmetic is simple, and doing it once removes the fear of "won't ten languages blow up the bill?"
Take a 90-minute feature. Audio bitrate is measured in kilobits per second (kbps) — thousands of bits each second. A stereo dub at a common streaming bitrate of 128 kbps works out like this:
size (bits) = bitrate (bits/sec) × duration (sec)
= 128,000 × (90 × 60)
= 128,000 × 5,400
= 691,200,000 bits
size (bytes) = 691,200,000 ÷ 8
≈ 86,400,000 bytes
≈ 86 MB
So one stereo language track for a 90-minute film is about 86 MB. Ship eight dub languages and you add roughly 8 × 86 MB ≈ 691 MB — under three-quarters of a gigabyte. Now compare that to the video. A single 1080p video rendition of the same film at, say, 5 Mbps (5,000 kbps) is:
5,000,000 × 5,400 ÷ 8 ≈ 3,375,000,000 bytes ≈ 3.4 GB
One 1080p video rendition alone is about 3.4 GB. The eight extra language tracks together (≈ 0.69 GB) cost about one-fifth of a single video rendition — and a real adaptive ladder stores several video renditions, so the audio is a small fraction of total storage. This is the key insight: audio languages are cheap relative to video. A described broadcast-mix track costs the same ~86 MB as a dub; a receiver-mix narration track, mostly silence, often compresses to a fraction of that.
The exception is immersive audio. A Dolby Atmos track for the same film, at a streaming bitrate around 768 kbps, is roughly:
768,000 × 5,400 ÷ 8 ≈ 518,400,000 bytes ≈ 518 MB
So one Atmos language is about six times a stereo language. Offering Atmos in eight languages (≈ 4.1 GB) starts to rival the video itself, which is why services usually limit immersive audio to the original and a small number of premium dubs. (The immersive-audio side is covered in the companion piece on Atmos and immersive audio in streaming.) To run these numbers for your own catalogue, use the cheat sheet below, which packs the per-language sizes, the signalling tags, and the legal thresholds onto one page.
A common pitfall: the language-tag mismatch
The single most common production bug in multi-language audio is not a bitrate or a mix; it is a mismatched or missing language tag. If the audio file's internal language metadata says eng but the manifest LANGUAGE (HLS) or lang (DASH) says fr, different players resolve the conflict differently — some trust the manifest, some trust the file — and the menu shows the wrong label or auto-selects the wrong track. The same class of bug hides a described track when its CHARACTERISTICS or Accessibility descriptor is dropped during repackaging, which then quietly fails an accessibility audit even though the audio itself is perfect. Validate that the manifest tag, the container metadata, and the menu label agree for every track, on every packaging run — the tags are the contract, and a silent mismatch is worse than a missing track because nobody notices until a viewer complains.
Where Fora Soft fits in
Multi-language and accessible audio show up across the video products we build. In OTT and Internet-TV platforms, the work is authoring correct HLS and DASH manifests so dubs, original tracks, and described tracks select reliably across browsers, mobile, and smart TVs. In e-learning and telemedicine, multi-language audio and accessibility are often a procurement requirement, and the described-track and caption stack has to pass an audit. In video conferencing and AR/VR, real-time language tracks and interpreter feeds raise the same labelling questions in a live setting. Across these verticals, the recurring task is the same: make the player pick the right soundtrack, every time, for every viewer, and prove it does.
What to read next
- Audio in HLS, DASH, CMAF: how a player picks an audio track
- Audio adaptive bitrate ladders: do you actually need them?
- LUFS targets per platform in 2026
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your multi language audio streaming plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Multi-Language & Accessible Audio Cheat Sheet — HLS EXT-X-MEDIA and DASH Role signalling for original/dub/described tracks, broadcast-mix vs receiver-mix, per-language storage math, and the US (47 CFR 79.3) and EU (EAA) accessibility thresholds on one page.
References
- IETF RFC 8216, HTTP Live Streaming, §4.3.4.1 (EXT-X-MEDIA), August 2017 — defines
NAME(required),DEFAULT,AUTOSELECT(MUST be YES if DEFAULT=YES),FORCED(subtitles only), and the audio characteristicpublic.accessibility.describes-video. Primary source (Tier 1). https://www.rfc-editor.org/rfc/rfc8216.html - IETF draft-pantos-hls-rfc8216bis (latest revision), the in-progress update to RFC 8216 — confirms the same characteristics list; Internet-Draft, subject to change before publication as an RFC. https://datatracker.ietf.org/doc/html/draft-pantos-hls-rfc8216bis
- ISO/IEC 23009-1:2022, Dynamic adaptive streaming over HTTP (DASH) — Part 1, §5.8.5 (Role/Accessibility descriptors) — defines the
urn:mpeg:dash:role:2011role scheme (main,alternate,commentary,dub,description,sign, …). Primary source (Tier 1); normative text paywalled, cited from the standard's catalogue and the DASH-IF guidelines that mirror it. https://www.iso.org/standard/79329.html - DASH-IF Interoperability Guidelines, Part 8: Audio, v5.0.0 (2021-11) — clarifies
main/alternatefor directly-presentable services vscommentaryfor receiver-mix tracks. Tier 3 (standards-body reference implementation guidance). https://dashif.org/docs/IOP-Guidelines/DASH-IF-IOP-Part8-v5.0.0.pdf - W3C Web Content Accessibility Guidelines (WCAG) 2.1 — success criterion 1.2.5 (Audio Description, Level AA) and 1.2.6 (Sign Language, Level AAA). Primary source (Tier 1). https://www.w3.org/WAI/WCAG21/Techniques/general/G54
- US Code of Federal Regulations, 47 CFR 79.3, Audio description of video programming — the 87.5-hours-per-quarter requirement and the market-expansion schedule. Primary source (Tier 1). https://www.ecfr.gov/current/title-47/chapter-I/subchapter-C/part-79/subpart-A/section-79.3
- Federal Communications Commission, Consumer Guide: Audio Description, and the January 2026 market-expansion notice (DMAs 111–120; top-five non-broadcast networks set 2024-07-01). Tier 3 (regulator guidance). https://www.fcc.gov/consumers/guides/audio-description
- European Accessibility Act (Directive (EU) 2019/882), in force 2025-06-28, with EN 301 549 as the harmonized technical standard, operating alongside the AVMSD. Tier 1 (legislation) plus Tier 6 explanatory coverage. https://digital-strategy.ec.europa.eu/en/policies/avmsd-content-distribution
- Netflix Partner Help Center, Localization, Accessibility, and Dubbing Branded Delivery Specifications — real-world deployment of dubbed and described tracks, broadcast-mix and receiver-mix patterns. Tier 4 (production deployer). https://partnerhelp.netflixstudios.com/hc/en-us/articles/7357416307603
Note on a source conflict (per our standards-first rule): some vendor blogs describe
AUTOSELECTas merely cosmetic. RFC 8216 §4.3.4.1 is normative —AUTOSELECT=YESpermits environment-based selection, and it MUST be YES when DEFAULT is YES. The article follows the RFC.


