Why this matters
If you are a founder, product manager, or first-time streaming CTO, the non-video tracks are where catalogs quietly become unscalable or non-compliant. A team that muxes (merges) audio into the video segments discovers, three languages in, that adding a fourth means re-packaging the entire video ladder again — and the storage bill grows with every dub. A team that treats captions as "subtitles we'll add later" discovers that any title which aired on US television with captions is legally required to carry them online, and that subtitles for hearing viewers are not the same artifact as captions for deaf and hard-of-hearing viewers. This article gives you the mental model to plan the audio, subtitle, and accessibility tracks once: what each track type is, how to keep audio cheap by decoupling it from video, which caption format to choose, how the tracks ride in the manifest, and the accessibility law that turns some of these from "nice to have" into "must ship".
The package is more than a stack of video
Earlier in this block we treated the encoding ladder as a set of video renditions — the same title at several resolution-and-bitrate steps so the player can pick the rung the network can afford. (If that idea is new, start with the encoding ladder explained.) That picture is incomplete. A real streaming package carries several parallel lanes, only one of which is video.
Think of the package as a multitrack recording, not a single tape. There is one video lane (the ladder of renditions). Beside it sits an audio lane that can hold several tracks — the original language, dubs in other languages, a director's commentary, a described-video track for blind and low-vision viewers. Beside that sits a text lane holding subtitle and caption tracks, one per language, plus special-purpose tracks. The player assembles a viewing session by picking one rung from the video lane and one track from each of the others: 1080p video, Spanish audio, English captions, for example. The job of packaging is to lay these lanes side by side so any combination plays in sync.
The reason this matters for cost and scale is that the lanes have wildly different sizes. Video is enormous; audio is small; text is tiny. A two-hour film's video ladder might sum to roughly 17.5 megabits per second of stored bitrate across its rungs (the ladder math is worked out in the encoding ladder article). One stereo audio track is around 128–256 kilobits per second — under 2% of that. A full subtitle file for a feature is a few hundred kilobytes — a rounding error. So the tracks themselves are cheap. What is not cheap is accidentally tying a cheap track to an expensive one, which is exactly the mistake the next section exists to prevent.
Figure 1. The package is multitrack: one video ladder beside several audio tracks, several text tracks, and accessibility tracks. The player picks one from each lane and plays them in sync.
Audio: one set of video, many languages — if you demux
Here is the most expensive decision in this whole article, and most teams make it by accident.
When you package video and audio, you can either mux them — interleave the audio samples into the same segment files as the video, so each segment is a self-contained video-plus-audio chunk — or demux them, keeping audio in its own separate segment files that the manifest points to alongside the video. Muxing is the obvious default, because it is how a normal MP4 on your laptop works: one file, picture and sound together. For single-language streaming it is fine. For multi-language streaming it is a trap.
The trap is duplication. If audio is muxed into the video segments, then every audio language needs its own complete copy of the video segments, because the video and audio are physically welded together in the same files. Two languages means two full video ladders. Ten languages means ten. The video — the expensive lane — gets multiplied by a number that has nothing to do with video.
Demuxing breaks the weld. With audio in separate files, you store the video ladder once and add one small audio track per language beside it. HLS was built for exactly this: the EXT-X-MEDIA tag, defined in the HLS specification (IETF RFC 8216 §4.3.4.1), declares an alternate audio rendition that the player fetches separately and plays in sync with whichever video rung it chose. DASH does the same by putting audio in its own AdaptationSet (ISO/IEC 23009-1). Decoupling audio from video is the standard, supported way to serve many languages without re-storing the picture.
Figure 2. Muxing welds audio into the video, so each language re-stores the whole ladder; demuxing shares one video ladder and adds a small track per language — about 9× less.
Walk the arithmetic out loud, because the gap is dramatic. Take that two-hour film with a 17.5 Mbps video ladder, and say you want ten audio languages at 128 kbps each:
Muxed (audio welded to video):
one language = 17.5 Mbit/s of video variants
ten languages = 17.5 Mbit/s × 10 = 175 Mbit/s stored
(you stored the whole picture ten times)
Demuxed (audio separate):
video, stored once = 17.5 Mbit/s
audio, ten tracks = 0.128 Mbit/s × 10 = 1.28 Mbit/s
total = 17.5 + 1.28 = 18.78 Mbit/s
(the picture once, plus a ~7% audio layer)
Ten languages cost you about 9× more storage and egress when muxed than when demuxed — 175 against 18.78 megabits per second of stored material. The same logic flows straight into your delivery bill, because a content delivery network caches and charges for whatever it serves, and ten copies of the video cache ten times worse than one. (Egress is the recurring cost that decides margin; see CDN cost engineering.) Demuxed audio is not an optimization you add later — it is the default a multi-language catalog must start from.
A word on which audio codec. The universal baseline is AAC-LC (Advanced Audio Coding, Low Complexity) — every device that plays HLS or DASH decodes stereo AAC, so it is the safe floor for reach. Above it sit the surround and immersive formats: Dolby Digital (AC-3) and the more efficient Dolby Digital Plus (E-AC-3), which carry 5.1 surround and, with Dolby Atmos, object-based immersive audio; and Dolby AC-4, Dolby's newer codec. Apple's HLS authoring rules list AAC, AC-3, E-AC-3, and (on newer systems) the bitrate-adaptive xHE-AAC among supported audio formats. The product pattern is to ship AAC-LC stereo as the always-plays fallback and offer a surround rendition (commonly E-AC-3) as an alternate in the same audio group for the living-room devices that support it. That is another reason to demux: different codecs for the same language are just more renditions in the audio group, not more copies of the video.
| Audio codec | Channels | Device coverage | Use in the package |
|---|---|---|---|
| AAC-LC | Stereo (2.0) | Universal — every HLS/DASH device | The baseline fallback; always include it |
| HE-AAC v1/v2 | Stereo | Very broad | Low-bitrate stereo for constrained networks |
| xHE-AAC | Stereo | Newer iOS/Android, modern TVs | Bitrate-adaptive speech/music; check device list |
| Dolby Digital Plus (E-AC-3) | 5.1 / Atmos | Most smart TVs, streaming devices, consoles | Surround/immersive alternate for the living room |
| Dolby AC-4 | 5.1 / Atmos | Newer TVs and devices | Next-gen immersive; verify the target device matrix |
Table 1. Audio codecs for OTT, with the coverage column that tells you reach. AAC-LC is the floor everything else sits above; demuxed audio lets you offer several codecs per language as alternates. Device support is dated — re-verify against your target devices, per renditions per device.
Loudness: the track nobody notices until it's wrong
There is one property of the audio lane that viewers never thank you for getting right and always punish you for getting wrong: loudness. If your trailer plays at one perceived volume and your feature plays quieter, the viewer grabs the remote — and blames your app. Consistency across a catalog is an engineering requirement, not a mastering nicety.
Loudness is measured, not guessed. The international standard for how to measure the perceived loudness of program audio is ITU-R BS.1770 (current edition BS.1770-4, 2015), which defines a "K-weighted" measurement that approximates how human hearing weights different frequencies. Its unit is LUFS (Loudness Units relative to Full Scale), identical to the LKFS you will see in American documents — the two acronyms name the same number. Around that measurement, different industries set different targets:
- Broadcast Europe anchors on EBU R128: an integrated program loudness of −23 LUFS.
- Broadcast US anchors on ATSC A/85: −24 LKFS — the practice behind the US "CALM Act" that stops ads being louder than programs.
- On-demand streaming runs quieter. Netflix, Amazon, Disney+, and Max target around −27 LKFS measured dialogue-gated (loudness sampled only where speech is present) with a true-peak ceiling of −2 dBTP.
- The AES published a streaming-specific recommendation, TD1008 (2021), suggesting roughly −18 LUFS for speech-led content and −16 LUFS for music, to suit headphone and mobile listening.
You do not have to memorize these — you have to pick one target per delivery context and normalize every title to it. The cost of skipping this is concrete: a −16 LUFS music video next to a −27 LKFS film is an 11 dB jump, and roughly every 10 dB doubles perceived loudness, so the viewer hears the next title as about twice as loud. The mechanics of measuring and correcting loudness — gating, true-peak limiting, loudness range — are an audio-engineering topic in their own right, and we keep them where they belong: see loudness normalization in our Audio for Video section. For the OTT product decision, hold one rule: choose a catalog-wide loudness target and enforce it in your ingest QC (encoding QC and the mezzanine workflow is where that gate lives).
Subtitles and captions are not the same thing
The text lane carries two things that look identical on screen and are different in law and in content. Getting the distinction right is the difference between an accessible platform and a lawsuit.
Subtitles assume you can hear the audio but not understand the language. They translate spoken dialogue — French audio, English subtitles — and nothing else. Captions assume you cannot hear the audio at all. They include the dialogue and the non-speech information a hearing viewer gets for free: a phone ringing, a door slamming, [ominous music], the fact that an off-screen voice is speaking. The fuller caption form is often labelled SDH (Subtitles for the Deaf and Hard-of-hearing), which packages caption-style information in the look of subtitles. Shipping subtitles and calling the platform accessible is a common and serious error: a deaf viewer needs the captions, not the translation.
The text itself comes in a small set of formats, and the right one depends on how it is carried.
- WebVTT (Web Video Text Tracks) is the web's common caption format — a simple plain-text format that grew out of the SubRip (
.srt) tradition. In fragmented-MP4 / CMAF packaging it is carried as thewvtttrack type. It is the text format Apple's HLS path leans on, and the pragmatic default for most catalogs. - IMSC 1.1 (TTML Profiles for Internet Media Subtitles and Captions) is the broadcast-grade option — a profile of the XML-based Timed Text Markup Language (TTML), published as a W3C Recommendation (8 November 2018; second edition 4 August 2020). It offers richer styling and positioning, defines both a Text profile and an Image profile (each cue a PNG, for scripts that need exact rendering), and is the TTML profile that the CMAF packaging standard allows. In fMP4 it is carried as the
stpptrack type. - CEA-608 / CEA-708 — now standardized as CTA-608 and CTA-708 by the Consumer Technology Association — are the embedded broadcast captions. Unlike WebVTT and IMSC, they are not separate sidecar files; the caption data is carried inside the video stream itself. CEA-608 is the analog-era format (the white-on-black uppercase look); CEA-708 is the digital successor with fonts, colors, and positioning, and it must also carry 608 data for backward compatibility ("608 over 708"). They matter for OTT mainly when your source comes from a broadcast chain that already has captions baked into the video, and for the living-room devices that expect them.
This carriage difference drives a packaging consequence worth flagging now. WebVTT and IMSC are sidecar text — separate files the manifest points to, just like demuxed audio, so they cost almost nothing and are easy to add per language. CEA-608/708 are embedded — welded into the video segments — which means they ride along automatically but cannot be added, swapped, or styled without touching the video. In HLS this shows up in the spec itself: a CLOSED-CAPTIONS rendition declared with EXT-X-MEDIA has no URI, precisely because the captions live in the video segments, whereas a SUBTITLES rendition does have a URI pointing at its own sidecar files (RFC 8216 §4.3.4.1).
| Text format | Carriage | HLS? | DASH? | Styling | Best for |
|---|---|---|---|---|---|
WebVTT (wvtt) |
Sidecar fMP4 / text | Yes | Yes | Basic | The pragmatic web default; easy per-language |
IMSC 1.1 / TTML (stpp) |
Sidecar fMP4 | Yes | Yes | Rich (text + image) | Broadcast-grade styling, premium catalogs |
| CEA-608 (CTA-608) | Embedded in video | Yes | Limited | Minimal | Legacy broadcast source; some TV devices |
| CEA-708 (CTA-708) | Embedded in video | Yes | Limited | Moderate | Broadcast-origin captions, US TV devices |
Table 2. Caption and subtitle formats, with the HLS/DASH coverage columns. Sidecar formats (WebVTT, IMSC) add per-language tracks cheaply; embedded formats (608/708) ride inside the video and cannot be added or styled without re-touching it.
The accessibility tracks: described video and forced narrative
Two more track types round out the lanes, and both are easy to forget until a viewer — or a regulator — points out the gap.
Described video (also called audio description) is a separate audio track that narrates the important visual action during pauses in dialogue: "she slips the note into his coat pocket." It serves blind and low-vision viewers, and it is a full audio rendition in its own right — another reason the audio lane needs to be cheap to extend. In HLS it is flagged with the CHARACTERISTICS attribute value public.accessibility.describes-video; in DASH it carries an Accessibility or Role descriptor marking it as a description track.
Forced-narrative subtitles (or "forced narrative", "forced subtitles") are a small, special subtitle track that appears automatically for everyone, regardless of their subtitle setting, to translate dialogue or signs the main audience cannot understand — the alien dialogue in an English film, a foreign-language sign, a letter shown on screen. They are not the full subtitle track; they are the handful of lines that must always be readable. HLS marks them with FORCED=YES on the EXT-X-MEDIA tag (RFC 8216 §4.3.4.1). Skip them and an English-speaking viewer of an English film simply cannot follow the scene where two characters speak Russian — a quality gap that generates support tickets and bad reviews.
HLS exposes the accessibility intent of a track through that same CHARACTERISTICS attribute, using Apple's Uniform Type Identifiers: public.accessibility.transcribes-spoken-dialog (this subtitle track is really captions), public.accessibility.describes-music-and-sound (it includes non-speech sounds — the SDH signal), and public.easy-to-read (edited for easier reading). These are not decoration: they are how a player's accessibility menu knows to offer "English (CC)" rather than plain "English", and how the platform proves it shipped captions rather than mere subtitles.
Figure 3. The non-video track types and who each serves: dubbed audio, subtitles for translation, SDH captions for deaf and hard-of-hearing viewers, described video for blind viewers, and forced-narrative subtitles that always show.
How the tracks ride in the manifest
All of this — audio languages, subtitle languages, captions, accessibility tracks — is wired together in the manifest, the small text index the player reads first. The wiring is simpler than it looks: the manifest declares each non-video track as a member of a named group, then tells each video rung which groups go with it.
In HLS, the master playlist lists each alternate track with an EXT-X-MEDIA tag carrying its TYPE (AUDIO, SUBTITLES, or CLOSED-CAPTIONS), a GROUP-ID, a LANGUAGE, human-readable NAME, and selection flags (DEFAULT, AUTOSELECT, FORCED). Each video Variant Stream (EXT-X-STREAM-INF, RFC 8216 §4.3.4.2) then names the audio and subtitle groups it pairs with. A trimmed example:
#EXTM3U
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud",NAME="English",LANGUAGE="en",DEFAULT=YES,AUTOSELECT=YES,URI="audio/en.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud",NAME="Español",LANGUAGE="es",AUTOSELECT=YES,URI="audio/es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud",NAME="English (described)",LANGUAGE="en",CHARACTERISTICS="public.accessibility.describes-video",URI="audio/en-ad.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="sub",NAME="English (CC)",LANGUAGE="en",CHARACTERISTICS="public.accessibility.transcribes-spoken-dialog,public.accessibility.describes-music-and-sound",AUTOSELECT=YES,URI="subs/en.m3u8"
#EXT-X-STREAM-INF:BANDWIDTH=6000000,CODECS="avc1.640028,mp4a.40.2",AUDIO="aud",SUBTITLES="sub"
video/1080p.m3u8
One video line, several tracks beside it, all sharing the same video. In DASH, the same idea uses separate AdaptationSets — one per media type and language — each with an @lang attribute and, for accessibility, a Role or Accessibility descriptor (the role scheme urn:mpeg:dash:role:2011, ISO/IEC 23009-1):
<AdaptationSet contentType="audio" lang="es">
<Representation id="aud-es" bandwidth="128000" codecs="mp4a.40.2"/>
</AdaptationSet>
<AdaptationSet contentType="text" lang="en">
<Role schemeIdUri="urn:mpeg:dash:role:2011" value="caption"/>
<Accessibility schemeIdUri="urn:tva:metadata:cs:AudioPurposeCS:2007" value="2"/>
<Representation id="sub-en" mimeType="application/mp4" codecs="stpp"/>
</AdaptationSet>
The shape is the same in both formats: video once, every other track declared beside it and selected by language and role. This is the manifest layer of the same "package once" principle that governs the video itself — see packaging: CMAF, HLS, and DASH from one mezzanine — and it is why decoupling audio and text from video is not just cheaper but the way the standards expect you to work. For the byte-level format details — how WebVTT and IMSC are wrapped in fMP4, how DASH descriptors resolve — see our Video Streaming write-up on WebVTT, IMSC, and multiple audio renditions.
Figure 4. How the tracks ride in the manifest. One video variant references named audio and subtitle groups; HLS uses EXT-X-MEDIA, DASH uses separate AdaptationSets keyed by language and role.
The law you can't skip: CVAA and WCAG
Some of these tracks are optional product polish. Others are legal obligations, and the line between them is worth knowing before launch, not after a complaint.
In the United States, the 21st Century Communications and Video Accessibility Act of 2010 (CVAA) directed the Federal Communications Commission to require captions on internet-delivered video. The implementing rule, 47 CFR §79.4, is specific: full-length video programming delivered over the internet must carry closed captions if it was shown on US television with captions, phased in by content type — prerecorded-and-unedited from 30 September 2012, live and near-live from 30 March 2013, and prerecorded-but-edited-for-internet from 30 September 2013. The practical test is simple: if a title aired on US TV with captions, your online version needs them too, and the exemptions are narrower than for traditional broadcast. Content that never aired on TV is not covered by §79.4 — but that is a narrow carve-out, not a general pass, and it does not touch the separate accessibility expectations below.
Where a platform follows the Web Content Accessibility Guidelines (WCAG) 2.1 — the reference standard many accessibility laws and procurement rules point to — three success criteria govern this article's tracks: 1.2.2 Captions (Prerecorded) at Level A requires captions for prerecorded audio; 1.2.4 Captions (Live) at Level AA extends that to live content; and 1.2.5 Audio Description (Prerecorded) at Level AA requires a described-video track for prerecorded video. A platform claiming WCAG 2.1 AA conformance is therefore committing to captions on everything and audio description on prerecorded video — which is exactly why the described-video and SDH tracks above are not optional extras for many operators. The full legal treatment lives in Block 8: see accessibility law for streaming: captions, audio description, WCAG. For this article, the takeaway is that your track plan has a compliance dimension, and the cheapest time to satisfy it is when you first design the lanes — not after launch.
A common mistake: muxing audio into video
The single most expensive error in this article is the one a team never sees, because the stream works: muxing audio into the video segments. A single-language pilot ships, audio welded into each video chunk, and it plays perfectly. Then the catalog goes international. Adding Spanish means re-packaging the entire video ladder with Spanish audio inside it; adding eight more languages means eight more full copies of the picture. Storage and egress climb with a multiplier — up to roughly 9× on a ten-language title, as the arithmetic above showed — and nobody connects the rising bill to a packaging choice made on day one. The fix is to demux audio from the start, even for a single-language launch, so the second language is a small new track and not a second catalog.
Three related faults travel with it. The first is subtitles posing as captions: shipping translation-only subtitle tracks and claiming accessibility, when deaf and hard-of-hearing viewers need SDH captions that include non-speech sound — and, in the US, when §79.4 may legally require them. The second is skipping forced-narrative subtitles, so a viewer of an English film cannot read the scene where characters speak another language; it is a small track with an outsized effect on perceived quality. The third is no catalog-wide loudness target, so titles arrive at whatever loudness their masters happened to have and viewers ride the volume control between every program. Each fault is cheap to prevent in the track plan and expensive to retrofit across a live catalog.
Where Fora Soft fits in
The non-video tracks are where a streaming catalog's language reach, accessibility compliance, and per-title storage are quietly decided, and engineering them so they scale — demuxed audio that adds a language without re-storing the picture, a caption strategy that satisfies CVAA and WCAG, a catalog-wide loudness target enforced at ingest, and forced-narrative and described-video tracks wired correctly into every manifest — is the difference between a catalog that internationalizes cheaply and one that doubles its bill with every dub. Fora Soft has built video streaming, OTT/Internet TV, e-learning, telemedicine, and video surveillance software since 2005, across 625+ shipped projects for 400+ clients, and that work centers on exactly this kind of scale-and-compliance engineering: designing packaging and manifest workflows so one master serves every language and every accessibility need. When a media company needs a platform whose audio, subtitle, and accessibility tracks survive a real, multi-territory audience, that track-and-manifest engineering is the capability we bring.
What to read next
- The encoding ladder explained: renditions, resolutions, bitrates
- Packaging: CMAF, HLS, and DASH from one mezzanine
- Loudness normalization
Call to action
- Talk to a streaming engineer — book a 30-minute scoping call to talk through your audio and subtitle tracks plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Audio & Subtitle Track Worksheet — A one-page worksheet to lock your non-video tracks before you ask a vendor for a quote: list your audio languages and codecs (AAC-LC floor plus surround alternates), set a catalog-wide loudness target, choose caption formats (WebVTT vs….
References
- RFC 8216 — HTTP Live Streaming (HLS) — IETF. §4.3.4.1 (
EXT-X-MEDIA: theTYPE,GROUP-ID,LANGUAGE,DEFAULT,AUTOSELECT,FORCED, andCHARACTERISTICSattributes; the rule that aCLOSED-CAPTIONSrendition has noURIwhile aSUBTITLESrendition does) and §4.3.4.2 (EXT-X-STREAM-INFlinking a video variant to itsAUDIOandSUBTITLESgroups). Tier 1 (official standard). https://www.rfc-editor.org/rfc/rfc8216.html (accessed 2026-06-16) - ISO/IEC 23009-1 — Dynamic Adaptive Streaming over HTTP (MPEG-DASH) — ISO/IEC. The
AdaptationSetmodel that separates audio, video, and text by@contentTypeand@lang, and theRole(urn:mpeg:dash:role:2011) andAccessibilitydescriptors that mark caption, description, and dub tracks. Tier 1. https://www.iso.org/standard/83314.html (accessed 2026-06-16) - TTML Profiles for Internet Media Subtitles and Captions 1.1 (IMSC 1.1) — W3C Recommendation, 8 November 2018; second edition 4 August 2020. Defines the Text and Image profiles of TTML used for broadcast-grade subtitles/captions, and the profile CMAF permits (carried as
stppin fMP4). Tier 1. https://www.w3.org/TR/ttml-imsc1.1/ (accessed 2026-06-16) - WebVTT: The Web Video Text Tracks Format — W3C. The plain-text caption/subtitle format used on the web and carried as
wvttin fMP4/CMAF. Tier 1. https://www.w3.org/TR/webvtt1/ (accessed 2026-06-16) - Web Content Accessibility Guidelines (WCAG) 2.1 — W3C Recommendation. SC 1.2.2 Captions (Prerecorded, Level A), SC 1.2.4 Captions (Live, Level AA), SC 1.2.5 Audio Description (Prerecorded, Level AA). Tier 1. https://www.w3.org/TR/WCAG21/ (accessed 2026-06-16)
- 47 CFR §79.4 — Closed captioning of video programming delivered using Internet protocol — US Code of Federal Regulations (implementing the CVAA, Pub. L. 111-260). The phased dates (30 Sep 2012 / 30 Mar 2013 / 30 Sep 2013) and the "aired on US TV with captions" trigger. Tier 1 (statute/regulation). https://www.ecfr.gov/current/title-47/chapter-I/subchapter-C/part-79/subpart-A/section-79.4 (accessed 2026-06-16)
- Recommendation ITU-R BS.1770 — Algorithms to measure audio programme loudness and true-peak audio level — ITU-R (current edition BS.1770-4, 2015). Defines K-weighted loudness measurement and the LUFS/LKFS unit underlying every loudness target. Tier 1. https://www.itu.int/rec/R-REC-BS.1770 (accessed 2026-06-16)
- EBU R 128 — Loudness normalisation and permitted maximum level of audio signals — European Broadcasting Union. The −23 LUFS integrated target for European program audio. Tier 2 (standards body recommendation). https://tech.ebu.ch/publications/r128 (accessed 2026-06-16)
- ATSC A/85 — Techniques for Establishing and Maintaining Audio Loudness for Digital Television — Advanced Television Systems Committee. The −24 LKFS US broadcast target behind the CALM Act. Tier 2. https://www.atsc.org/atsc-documents/a85-techniques-for-establishing-and-maintaining-audio-loudness-for-digital-television/ (accessed 2026-06-16)
- HLS Authoring Specification for Apple Devices — Apple Inc. Supported audio codecs (AAC, AC-3, E-AC-3, xHE-AAC) and the rendition-group authoring rules for alternate audio and subtitles. Tier 3 (first-party platform spec). https://developer.apple.com/documentation/http-live-streaming/hls-authoring-specification-for-apple-devices (accessed 2026-06-16)
- AES TD1008 — Recommendations for Loudness of Internet Audio Streaming and On-Demand Distribution — Audio Engineering Society (2021). The −18 LUFS speech / −16 LUFS music streaming-loudness guidance. Tier 3 (industry standards body). https://www.aes.org/technical/documents/AESTD1008.cfm (accessed 2026-06-16)
- CTA-708 (formerly CEA-708) and CTA-608 — Consumer Technology Association. The embedded digital/analog caption standards, "608-over-708" backward compatibility, carried inside the video stream. Tier 2 (standards body); industry orientation via Wikipedia/3Play. https://en.wikipedia.org/wiki/CTA-708 (accessed 2026-06-16)
Source note (per §4.3.2): the manifest mechanism — HLS EXT-X-MEDIA rendition groups and the CLOSED-CAPTIONS-has-no-URI rule, DASH AdaptationSet/Role/Accessibility descriptors, the IMSC and WebVTT text formats — traces to tier-1 standards (refs 1–4). The accessibility obligations trace to WCAG 2.1 and 47 CFR §79.4 (refs 5–6). Loudness measurement is ITU-R BS.1770 (ref 7); the specific targets are EBU/ATSC/AES recommendations (refs 8, 9, 11) and the on-demand −27 LKFS practice is vendor-reported and labelled as such in-text. Audio codec support is Apple's first-party authoring spec (ref 10). No lower-tier source overrode a standard.


