Audio in HLS, DASH, CMAF: how a streaming player actually picks an audio track

Why this matters

If you ship video on demand or live to a browser, a phone, a smart TV, or a set-top box, your player is choosing an audio track for the viewer every single session, and you control that choice only through the manifest. This article is for a product manager, founder, or operations lead with no streaming background who needs to understand why a Spanish viewer got English audio, why the 5.1 track never appeared, or why an ad break knocked the audio out of sync — and to talk to engineers about fixing it. By the end you will be able to read an HLS or DASH manifest's audio section, name the tag that picks the default track, and know which authoring mistakes cause the field's most common complaints. Every rule traces back to the controlling specification — the HLS spec, ISO/IEC 23009-1 for DASH, ISO/IEC 23000-19 for CMAF — not to a vendor summary.

The one idea that explains everything: the player chooses

Start with the mental model that the rest of the article hangs on. In streaming video, the server does not push a chosen audio track to the viewer. Instead, the server publishes a menu — a small text file called a manifest — and the player on the viewer's device reads that menu and decides what to fetch. The audio you hear was selected by code running on the phone or TV, following rules you wrote into the manifest weeks earlier.

That is the opposite of a phone call, where a live negotiation picks the codec for both ends, and it is the opposite of a downloaded file, where the audio is simply whatever is inside. A streaming manifest is more like a restaurant menu with a "house recommendation" marked: the kitchen lists every dish, flags one as the default, marks some as available on request, and the diner — here, the player — makes the final call. If the menu marks the wrong dish as the recommendation, every diner who does not actively choose gets the wrong thing. Hold onto that: almost every audio-selection bug is a menu-authoring bug.

Two manifest formats dominate the open internet in 2026. Apple's HTTP Live Streaming (HLS) uses a playlist written in a line-based text format. MPEG's Dynamic Adaptive Streaming over HTTP (DASH) uses an XML document called a Media Presentation Description, or MPD. They describe the same kind of content in different dialects, and a single library of media files — packaged as CMAF, the format we cover last — can be described by both at once.

Two manifests, one media library: an HLS multivariant playlist and a DASH MPD both pointing at the same shared pool of CMAF audio and video segments, with the player on the right choosing one audio track Figure 1. The server publishes a menu, the player chooses. The same CMAF segment library can be described by an HLS playlist and a DASH MPD at once; the player reads whichever its platform speaks and selects an audio track from it.

How HLS describes audio: rendition groups

In HLS the top-level menu is called the multivariant playlist (older docs call it the master playlist). It does two things: it lists the video quality levels, and it lists the alternative audio tracks. The audio tracks are described with a tag named EXT-X-MEDIA, and the trick that makes HLS work is that these audio tracks are collected into a named bundle called a rendition group.

A rendition group is just a label. Every audio track that shares the same GROUP-ID value belongs to the same group. Then each video quality level — written with an EXT-X-STREAM-INF tag — carries an AUDIO attribute naming the group it is allowed to draw audio from. The video says, in effect, "for my soundtrack, pick any track from the group called aac-stereo." This indirection is what lets the player switch video quality up and down as the network changes while keeping the same chosen language playing underneath.

Here is a minimal multivariant playlist with three languages of stereo AAC and two video qualities:

#EXTM3U
#EXT-X-VERSION:7

# --- audio rendition group "aud-aac" ---
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud-aac",NAME="English",LANGUAGE="en",DEFAULT=YES,AUTOSELECT=YES,CHANNELS="2",URI="audio/en/main.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud-aac",NAME="Spanish",LANGUAGE="es",DEFAULT=NO,AUTOSELECT=YES,CHANNELS="2",URI="audio/es/main.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aud-aac",NAME="Director Commentary",LANGUAGE="en",DEFAULT=NO,AUTOSELECT=NO,CHANNELS="2",URI="audio/commentary/main.m3u8"

# --- video variants, each pointing at the same audio group ---
#EXT-X-STREAM-INF:BANDWIDTH=2200000,CODECS="avc1.640028,mp4a.40.2",AUDIO="aud-aac"
video/720p.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=6000000,CODECS="avc1.640028,mp4a.40.2",AUDIO="aud-aac"
video/1080p.m3u8

Read it the way the player does. There is one audio group, aud-aac, holding three tracks. Both video qualities point at that group, so whichever resolution the player lands on, it has the same three soundtracks to choose from. Now look at the three attributes that decide which track plays.

DEFAULT marks the track the player should pick when the user has expressed no preference. Exactly one track per group is normally marked DEFAULT=YES; here it is English. AUTOSELECT marks tracks the player is allowed to switch to on its own to satisfy a system setting — for instance, if the device's language is Spanish, a player may auto-pick the Spanish track even though English is the default, because Spanish is AUTOSELECT=YES. The commentary track is AUTOSELECT=NO, which means the player must never choose it automatically; the viewer has to ask for it by name. The CHANNELS attribute states the channel count — "2" for stereo — and a 5.1 track would carry CHANNELS="6".

Pitfall — two DEFAULT=YES tracks, or none. The single most common HLS audio bug is marking more than one track in a group as DEFAULT=YES, or forgetting to mark any. The spec expects at most one default per group; ship two and different players resolve the conflict differently — some take the first, some the last, some the device language — so your viewers get inconsistent audio you cannot reproduce. Mark exactly one DEFAULT=YES per group and let AUTOSELECT and LANGUAGE handle the rest.

Why surround usually gets its own group

Stereo and 5.1 surround are not just two tracks in one list. Apple's HLS Authoring Specification is explicit that audio of different channel counts should be split into separate groups, with separate AUDIO attributes on the video variants that match them. The reason is bitrate budgeting: a 5.1 track costs far more bits than a stereo track, so a 5.1 group is paired with the higher-bandwidth video variants, while a stereo group serves the lower ones. Putting a 192 kbit/s stereo track and a 640 kbit/s surround track in the same group would let the player pair heavy audio with light video and break the bandwidth math the variants promise.

The practical shape, then, is two groups — aud-aac-stereo and aud-ac3-51 — and each video variant lists the audio group appropriate to its bitrate tier. We unpack the bitrate trade-off itself in audio adaptive bitrate ladders.

How DASH describes audio: AdaptationSets and Representations

DASH carries the same information in a different container. Its manifest, the MPD, is XML, and it nests media in three levels that you need to keep straight.

A Period is a stretch of the timeline — the main feature is one Period; an inserted ad break is usually another Period. Inside a Period sit AdaptationSets, and an AdaptationSet groups versions of one component that are interchangeable. Inside each AdaptationSet sit Representations, which are the actual encoded files at different bitrates. The rule that matters for audio: each language, and each distinct channel layout, gets its own AdaptationSet. English stereo is one AdaptationSet; Spanish stereo is another; English 5.1 is a third. Within the English-stereo AdaptationSet you might have three Representations at 96, 128, and 192 kbit/s, and the player adapts between them as the network changes — the audio equivalent of switching video resolution.

<Period>
  <!-- English stereo: one AdaptationSet, three bitrate Representations -->
  <AdaptationSet contentType="audio" lang="en" audioSamplingRate="48000">
    <Role schemeIdUri="urn:mpeg:dash:role:2011" value="main"/>
    <Representation id="en-96"  codecs="mp4a.40.2" bandwidth="96000"/>
    <Representation id="en-128" codecs="mp4a.40.2" bandwidth="128000"/>
    <Representation id="en-192" codecs="mp4a.40.2" bandwidth="192000"/>
  </AdaptationSet>

  <!-- Spanish stereo: separate AdaptationSet -->
  <AdaptationSet contentType="audio" lang="es" audioSamplingRate="48000">
    <Role schemeIdUri="urn:mpeg:dash:role:2011" value="alternate"/>
    <Representation id="es-128" codecs="mp4a.40.2" bandwidth="128000"/>
  </AdaptationSet>

  <!-- English director commentary -->
  <AdaptationSet contentType="audio" lang="en" audioSamplingRate="48000">
    <Role schemeIdUri="urn:mpeg:dash:role:2011" value="commentary"/>
    <Representation id="comm-96" codecs="mp4a.40.2" bandwidth="96000"/>
  </AdaptationSet>
</Period>

DASH replaces HLS's DEFAULT / AUTOSELECT flags with a Role descriptor. The Role element with value="main" plays the part HLS's DEFAULT=YES plays: it marks the primary track. value="alternate" marks a secondary track such as another language; value="commentary" marks commentary; value="dub" marks a dubbed track; value="description" marks an audio description for visually impaired viewers, which we cover in multi-language and descriptive audio. The lang attribute carries the language, exactly as LANGUAGE does in HLS.

The mapping between the two formats is close enough to tabulate, which is the fastest way for a non-specialist to read either one:

Job	HLS	DASH
Group one logical audio choice	rendition group (`GROUP-ID`)	`AdaptationSet`
One bitrate version of that audio	a Media Playlist (`URI`)	`Representation`
"Play this if the user has no preference"	`DEFAULT=YES`	`<Role value="main">`
"You may auto-pick this for a system setting"	`AUTOSELECT=YES`	`<Role value="alternate">` + `lang`
Language tag	`LANGUAGE="es"`	`lang="es"`
Channel count	`CHANNELS="6"`	`audioChannelConfiguration` / separate set
Don't auto-pick; user must ask	`AUTOSELECT=NO`	`<Role value="commentary">`
A timeline segment (e.g. an ad)	discontinuity / separate playlist	a separate `Period`

Table 1. The same audio concepts in HLS and DASH dialects. Source: HLS specification (EXT-X-MEDIA attributes) and ISO/IEC 23009-1:2022 (AdaptationSet, Representation, Role).

Pitfall — surround and stereo in the same AdaptationSet. Mirroring the HLS rule, DASH players are not expected to switch between different channel layouts inside one AdaptationSet, and DASH-IF guidance keeps stereo and 5.1 in separate sets. If you place a stereo and a 5.1 Representation in one AdaptationSet, a player adapting for bandwidth may flip from surround to stereo mid-scene, which sounds like the room suddenly collapsing to the front speakers. Separate the layouts; let the player choose a layout once, then adapt bitrate within it.

CMAF: one set of files under both manifests

So far HLS and DASH look like two separate worlds, and historically they were: a service that wanted both had to package its audio twice, doubling storage and CDN cost. The Common Media Application Format, standardized as ISO/IEC 23000-19, ended that duplication for most modern services.

CMAF is not a third manifest. It is the file format of the media segments that the manifests point at. CMAF defines its media as fragmented MP4 — the same ISO base media file format family covered in audio in containers — built so that one physical set of audio and video segments can be referenced by an HLS playlist and a DASH MPD simultaneously. You encode and store the audio once; you write two small text manifests over it. The bytes on the CDN are shared.

CMAF organizes alternative encodings into a CMAF Switching Set: a set of tracks that can be switched and spliced cleanly at fragment boundaries, which is precisely what an HLS rendition group's tracks or a DASH AdaptationSet's Representations need to do. A switching set's tracks must share encoding and packaging constraints so the decoder never sees a discontinuity it cannot handle when the player swaps bitrates. The unit of switching is the CMAF Fragment; a CMAF Chunk is a smaller piece of a fragment used to cut latency, which matters for low-latency streaming.

One catch the standard is strict about: encryption mode. CMAF Common Encryption offers two schemes — cenc (AES in counter mode) and cbcs (AES in CBC mode with pattern encryption). HLS on Apple platforms requires cbcs. A CMAF presentation must use one scheme throughout; you cannot mix cenc and cbcs in the same presentation. If you want one CMAF library to feed both an HLS player (which needs cbcs) and a legacy DASH player encrypted with cenc, you are back to packaging twice — so most services that went all-in on CMAF standardized on cbcs to keep a single encrypted copy. DRM and packaging are covered from the video side in our Video Encoding section; the audio rule is simply: pick cbcs, encrypt once.

Diagram of a CMAF switching set: audio tracks at 96, 128 and 192 kbit/s aligned at fragment boundaries so the player can switch between them cleanly, with cbcs common encryption applied once across the set Figure 2. A CMAF switching set holds bitrate versions of one audio track, aligned so the player can switch at any fragment boundary without a glitch. Encrypt once with cbcs and both HLS and DASH can use the same bytes.

How the player actually decides — the selection algorithm in plain steps

Put the pieces together and the player's decision is a short, predictable sequence. The exact order varies a little between players, but the logic is the same on iOS AVPlayer, ExoPlayer on Android, Shaka Player, hls.js, and dash.js.

First, the player reads the manifest and finds the audio group (HLS) or the audio AdaptationSets (DASH). Second, it checks whether the user or the app has set an explicit audio preference for this session — "play in Spanish", "use the commentary track". If so, and a matching track exists, that wins, full stop. Third, with no explicit choice, the player looks at the device's system language and locale; if a track matches it and is allowed to be auto-selected (AUTOSELECT=YES in HLS, an alternate-or-main Role with matching lang in DASH), it picks that. Fourth, failing a language match, it falls back to the track marked default (DEFAULT=YES / Role=main). Fifth, having chosen which track, it then chooses which bitrate of that track to start with, and adapts up or down from there as the network changes — but it does not change language or channel layout while doing so.

Two consequences follow that explain most support tickets. A viewer in Mexico whose phone is set to Spanish gets Spanish audio only if you marked the Spanish track AUTOSELECT=YES and tagged it LANGUAGE="es"; tag it wrong and they get the English default. And a viewer can never reach the 5.1 track if you forgot to give the high-bitrate video variants an AUDIO group that contains a 5.1 rendition — the surround track is in your library but unreachable from the menu.

Flowchart of the player's audio selection: explicit user choice wins, else match device language if autoselectable, else fall back to the default track, then choose a bitrate and adapt Figure 3. The player's decision order: explicit user choice, then device-language match if the track is auto-selectable, then the default track. Language and channel layout are chosen once; only bitrate adapts after that.

A worked example: how much does multi-track cost?

Numbers make the trade-off concrete. Suppose a 90-minute film, delivered in three languages as stereo AAC plus one 5.1 surround track in the original language. Stereo AAC at 128 kbit/s and 5.1 E-AC-3 at 384 kbit/s. The arithmetic for one full copy of each audio track:

Duration            = 90 min = 5,400 seconds
Stereo track size   = 128 kbit/s ÷ 8 × 5,400 s = 86,400 kbit ÷ 8
                    = 86,400 kilobytes ≈ 86.4 MB per language
Three stereo langs  = 3 × 86.4 MB ≈ 259 MB
5.1 surround track  = 384 kbit/s ÷ 8 × 5,400 s ≈ 259 MB
Total audio (one bitrate each) ≈ 259 MB + 259 MB ≈ 518 MB

That 518 MB sits beside a video file that is likely several gigabytes, so audio is a small fraction of storage — but it multiplies with every language and every bitrate rung you add, and on a popular title the CDN egress is paid per gigabyte on every view. With CMAF you pay this storage once and serve both HLS and DASH clients from it; without CMAF you pay it twice. The full storage and CDN model, including when to prune rarely-watched language tracks, is in storage and CDN math for audio.

Multi-period DASH and the ad-insertion trap

The hardest audio bugs in streaming live at the seams between Periods. Recall that an inserted ad break is usually its own DASH Period. The main content's audio AdaptationSets and the ad's audio AdaptationSets are described independently, and nothing forces them to match. If the main content offers English, Spanish, and commentary, and the ad Period offers only a single unlabeled English track, the player reaching the ad cannot find the Spanish track the viewer was listening to — so it either falls silent, drops to English, or, in the worst implementations, fails to start the ad at all.

The same class of problem appears in HLS across an EXT-X-DISCONTINUITY, the tag that marks a splice point where timing and encoding may change. Audio that was a clean 48 kHz stereo AAC stream before the discontinuity might meet a 44.1 kHz ad on the far side, and the sample-rate change forces the player to re-initialize the audio decoder — a moment where pops, gaps, and lip-sync drift creep in. The fix is discipline at packaging time: every Period and every discontinuity-bounded segment should expose the same audio track structure — same languages, same Roles, same sample rate, same channel layouts — so the player can carry the viewer's choice across the seam without re-deciding. Server-side ad insertion that conditions the ad encode on the main content's audio configuration is the production-grade answer.

Pitfall — the ad break that resets the language. A viewer watching in Spanish hits an ad break, and the ad — authored by a different team or an ad server — only carries English. After the ad, some players keep Spanish, but many reset to the new Period's default and never switch back. To the viewer it looks like your app randomly switched their language. Audit every ad Period for the same language and Role structure as the main content before you blame the player.

Where Fora Soft fits in

We build OTT and live-streaming products where the audio menu is the difference between a product that ships and one that gets one-star reviews about "wrong language" and "no surround." In streaming projects across video on demand, e-learning, and live events, the audio defects we are called in to fix are almost never the audio files — they are the manifest: a missing AUTOSELECT, a surround track with no group reachable from the high-bitrate variants, an ad Period whose AdaptationSets do not mirror the main content. We standardize clients on CMAF with cbcs so one packaged library serves HLS and DASH, and we test the selection logic on real iOS, Android, and TV players rather than trusting a desktop browser, because the players resolve ambiguous manifests differently. Getting the menu right once, and validating it on the devices that matter, prevents the entire category of complaint.

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your audio in hls dash cmaf plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Streaming Audio Manifest — cheat sheet — One-page reference mapping HLS rendition groups to DASH AdaptationSets, the DEFAULT/AUTOSELECT/Role selection flags, the separate-group-per-channel-layout rule, the ad-Period mirroring rule, and the CMAF cbcs-encrypt-once rule.

References

HTTP Live Streaming, draft-pantos-hls-rfc8216bis (R. Pantos, Ed., IETF; the active revision of RFC 8216) — defines the EXT-X-MEDIA tag and its TYPE, GROUP-ID, NAME, LANGUAGE, DEFAULT, AUTOSELECT, CHANNELS, and URI attributes, and the EXT-X-STREAM-INF AUDIO attribute that binds a video variant to a rendition group. https://datatracker.ietf.org/doc/html/draft-pantos-hls-rfc8216bis
IETF RFC 8216, HTTP Live Streaming (R. Pantos, W. May; August 2017) — the published HLS specification; the normative source for rendition groups and the master/multivariant playlist before the rfc8216bis revisions. https://www.rfc-editor.org/rfc/rfc8216.html
Apple, HTTP Live Streaming (HLS) Authoring Specification for Apple Devices (current revision, accessed 2026-06-05) — the requirement to put different channel counts in separate audio groups, the cbcs encryption requirement, and the CHANNELS authoring rules. https://developer.apple.com/documentation/http-live-streaming/hls-authoring-specification-for-apple-devices
ISO/IEC 23009-1:2022, Information technology — Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats — defines Period, AdaptationSet, Representation, the Role descriptor scheme (urn:mpeg:dash:role:2011: main, alternate, commentary, dub, description), lang, and audioSamplingRate. https://www.iso.org/standard/83314.html
ISO/IEC 23000-19:2024, Information technology — Multimedia application format (MPEG-A) — Part 19: Common media application format (CMAF) for segmented media — defines the CMAF Track, Switching Set, Fragment and Chunk, and the cenc / cbcs Common Encryption schemes and the single-scheme-per-presentation rule. https://www.iso.org/standard/85623.html
DASH Industry Forum, Guidelines for Implementation: DASH-IF Interoperability Points (current edition) — interoperability guidance that stereo and multichannel audio belong in separate AdaptationSets and that audio AdaptationSets across Periods should be structurally consistent. https://dashif.org/guidelines/
CTA-5001, Web Application Video Ecosystem (WAVE) — Content Specification (CTA) — the deployment profile constraining CMAF for interoperable HLS/DASH delivery, including audio track and switching-set constraints. https://cdn.cta.tech/cta/media/media/resources/standards/pdfs/cta-5001-c_final.pdf
Bitmovin, Delivering premium HLS audio experiences (engineering blog, accessed 2026-06-05) — production deployer's account of HLS audio rendition groups, multichannel grouping, and DEFAULT/AUTOSELECT authoring; used for production context only, with normative claims taken from references 1–5. https://bitmovin.com/blog/premium-hls-audio/

Per §4.3.2, where popular tutorials describe AUTOSELECT and DEFAULT loosely as interchangeable, this article follows the HLS specification (references 1–2): DEFAULT is the no-preference fallback (at most one per group), while AUTOSELECT authorizes a player to choose a track to satisfy a system setting such as device language. The two are distinct and a track can be AUTOSELECT=YES, DEFAULT=NO.

Audio in HLS, DASH, CMAF: how a streaming player actually picks an audio track

Why this matters

The one idea that explains everything: the player chooses

How HLS describes audio: rendition groups

Why surround usually gets its own group

How DASH describes audio: AdaptationSets and Representations

CMAF: one set of files under both manifests

How the player actually decides — the selection algorithm in plain steps

A worked example: how much does multi-track cost?

Multi-period DASH and the ad-insertion trap

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Audio in HLS, DASH, CMAF: how a streaming player actually picks an audio track

Why this matters

The one idea that explains everything: the player chooses

How HLS describes audio: rendition groups

Why surround usually gets its own group

How DASH describes audio: AdaptationSets and Representations

CMAF: one set of files under both manifests

How the player actually decides — the selection algorithm in plain steps

A worked example: how much does multi-track cost?

Multi-period DASH and the ad-insertion trap

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

HLS

Stereo

CMAF

Bitrate

AAC

Channel

Channel layout

Audio rendition