Published: 2026-06-05 · Reading time: 16 min read · Author: Nikolay Sapunov, CEO at Fora Soft

Why this matters

If you stream video or audio, someone on your team has to decide how many versions of each soundtrack to encode, store, and serve — and that decision quietly sets your encoding cost, your storage bill, and the worst-case experience for a viewer on a weak connection. This article is for a product manager, founder, or operations lead with no streaming background who needs to push back on "let's just make five audio bitrates because we do that for video," or to argue for surround fallbacks that the team dismissed as unnecessary. By the end you will be able to do the bandwidth math yourself, name the two situations where an audio ladder earns its keep, and read the published ladders of the major platforms to benchmark your own. Every recommendation traces back to the controlling source — Apple's HLS Authoring Specification, the DASH standard ISO/IEC 23009-1, and Netflix's own published audio pipeline — not to a vendor's marketing page.


First, what a "ladder" actually is

A streaming ladder is the set of different-quality versions of one piece of content that you publish so the player can switch between them. The reader new to streaming should hold one picture: the server does not pick the quality, the player does. The server publishes a menu — covered in detail in audio in HLS, DASH, CMAF — and the player on the phone or TV measures the network and climbs the ladder up when bandwidth is plentiful, down when it is scarce. Each rung is one encoded version at a fixed bitrate.

For video this is obviously useful. The difference between a 360p rung at 700 kbit/s and a 1080p rung at 6,000 kbit/s is enormous, both in how many bits it eats and in how it looks. A viewer on a train tunnel needs the small rung; a viewer on home fibre wants the big one. Video quality keeps improving as you add bits across a wide range, so a video ladder with five or six rungs is normal and correct.

The instinct is then to copy that pattern onto audio: if video has six rungs, give audio six rungs too. That instinct is usually wrong, and the reason is arithmetic.

The arithmetic that kills most audio ladders

Here is the calculation every team should run before building an audio ladder. Take a typical adaptive video stream and look at what fraction of it the audio actually is.

A common 1080p video rung runs at about 6,000 kilobits per second. A stereo AAC soundtrack — the codec name stands for Advanced Audio Coding, and "stereo" means two channels, left and right — runs at about 128 kilobits per second. The audio share is:

audio share = audio bitrate ÷ (video bitrate + audio bitrate)
audio share = 128 ÷ (6,000 + 128)
audio share = 128 ÷ 6,128
audio share = 0.0209 → about 2.1%

So on a full-quality stream, the soundtrack is roughly one fiftieth of the bits. Now suppose the network gets worse and you want to save bandwidth by dropping the audio rung from 128 to 64 kbit/s. You save 64 kbit/s. On that 6,128 kbit/s stream, that is a one-percent reduction in total — invisible to the network, and the listener immediately hears the thinner sound. Meanwhile the player has already dropped the video from 6,000 to, say, 1,500 kbit/s, which is where the real bandwidth lives.

The lesson is blunt: when audio is a small passenger riding on a large video stream, adapting the audio buys you almost no bandwidth and costs you audible quality. You are better off picking one good audio rendition and leaving it fixed while the video ladder does the adapting. This is why the field's quiet default for ordinary stereo video is a single audio rendition, not a ladder.

Bar chart contrasting a six-rung video bitrate ladder from 360p to 1080p against a single flat stereo audio rendition at 128 kbit/s, showing audio as a thin two-percent sliver on the top video rung Figure 1. On an ordinary stereo video stream the soundtrack is about 2% of the bits on the top rung. Dropping the audio rung saves almost nothing; the video ladder is where the bandwidth lives.

Why audio quality plateaus where video keeps climbing

There is a second reason audio needs fewer rungs, and it is about the ear, not the network. Video quality keeps improving as you add bits across a very wide range — the eye can tell 480p from 720p from 1080p from 4K. Audio quality, by contrast, plateaus early. A modern codec reaches "I cannot tell this from the master" — what engineers call perceptually transparent — at a modest bitrate, and adding more bits past that point changes nothing a listener can hear.

For stereo AAC that plateau sits in the rough neighbourhood of 128 to 256 kbit/s; below it quality drops fast, above it you are spending bits no ear will spend back. So even when you do build an audio ladder, it has two to four rungs, not six, because there is simply no audible territory above the plateau to put a higher rung on. We cover where each codec's plateau sits in how to choose an audio codec in 2026.

The two cases where an audio ladder earns its keep

If the default is "one rendition," when do you actually need more than one? Two situations, and they are the whole answer.

Case 1 — surround and immersive audio

The arithmetic above assumed a 128 kbit/s stereo track. Replace it with surround sound and the numbers change. Surround means more than two channels — the common layout is 5.1, meaning five full channels plus one low-frequency effects channel, written as "5.1". A high-quality 5.1 track in Dolby Digital Plus can run at 640 kbit/s, and an immersive Dolby Atmos track higher still. Now redo the share on a mid-ladder 1,500 kbit/s video rung:

audio share = 640 ÷ (1,500 + 640)
audio share = 640 ÷ 2,140
audio share = 0.299 → about 30%

At 30% of the stream, the audio is no longer a passenger — it is a meaningful chunk of the bandwidth, and a viewer who cannot afford the full video rung very likely cannot afford a 640 kbit/s soundtrack either. This is exactly where a ladder pays off: offer the 640 kbit/s surround track to viewers who can take it, a 192 kbit/s surround track to those who cannot, and a stereo AAC track as the floor for the weakest connections and the devices that cannot decode surround at all. The player climbs that short audio ladder in step with the network.

Case 2 — audio-only services

The second case is simpler. When there is no video — music streaming, podcasts, audiobooks, radio — the audio is the entire stream. There is no large video passenger to hide behind, so every bit of audio adaptation is a bit of real bandwidth saved. A music service on a phone that drops from Wi-Fi to a weak cellular signal genuinely needs to fall from a 256 kbit/s rung to a 96 kbit/s rung to keep playing without stalls. Here a two-to-four-rung audio ladder is not optional; it is the core of the product.

Decision tree starting from is-there-video, branching through surround-vs-stereo and audio-only, ending in single-rendition, short-surround-ladder, and audio-only-ladder outcomes Figure 2. A decision rule you can apply in an afternoon. Stereo-on-video needs one rendition; surround and audio-only earn a short two-to-four-rung ladder.

What the real platforms ship in 2026

The best way to calibrate your own choice is to look at what the large platforms publish. None of them runs a six-rung audio ladder for ordinary stereo; the patterns below match the arithmetic above.

Netflix

Netflix's published audio pipeline is the clearest worked example of Case 1. For 5.1 surround the service uses a Dolby Digital Plus ladder that runs from 192 kbit/s up to 640 kbit/s, and it adapts across those rungs the same way it adapts video — delivering the best the device and the connection can take. Netflix's engineers determined, from internal and Dolby-supplied listening tests, that Dolby Digital Plus at 640 kbit/s and above is perceptually transparent: the 640 kbit/s top rung is a 10:1 compression of a 24-bit 5.1 studio master, and no listener in the tests could distinguish it. For Dolby Atmos, the immersive format, Netflix raised the top rung to 768 kbit/s. On devices that cannot decode Dolby at all, it falls back to stereo AAC or legacy AC-3. That is a textbook Case-1 ladder: a few surround rungs plus a stereo floor, and nothing more.

Apple (HLS Authoring Specification)

Apple does not publish a ladder; it publishes bounds, and the bounds tell you how few rungs are sensible. The HLS Authoring Specification recommends stereo AAC between 32 and 160 kbit/s — a narrow band that confirms the plateau argument, because there is no point authoring rungs above 160 kbit/s for stereo. For 5.1 surround Apple's guidance is explicit that AAC is not the right tool: use Dolby Digital at 384 kbit/s or Dolby Digital Plus at 192 kbit/s. The spec also requires that stereo and surround live in separate rendition groups, which we explain in audio in HLS, DASH, CMAF — so an Apple-compliant package naturally ends up with at most a couple of audio renditions per layout, never a tall ladder.

YouTube

YouTube Music shows the Case-2 pattern. It offers three quality tiers with an upper bound of 48 kbit/s on Low, 128 kbit/s on Normal, and 256 kbit/s on High, delivering AAC on the mobile apps and Opus — the open codec covered in Opus: the open codec that ate WebRTC — on the web player. That is a three-rung audio-only ladder, exactly the two-to-four-rung shape the arithmetic predicts when audio is the entire stream.

Platform Content type Audio rungs Codec(s) Pattern
Netflix Surround video 192 → 640 kbit/s (5.1), 768 (Atmos), stereo floor DD+, AC-3, AAC Case 1 — short surround ladder + stereo floor
Apple HLS Stereo video one rendition, 32–160 kbit/s AAC-LC Default — single rendition
Apple HLS Surround video DD 384 or DD+ 192 kbit/s AC-3, E-AC-3 Case 1 — separate surround group
YouTube Music Audio-only 48 / 128 / 256 kbit/s AAC, Opus Case 2 — audio-only ladder

Table 1. Real published audio renditions in 2026. Note that no platform runs a six-rung ladder for ordinary stereo. Sources: Netflix Tech Blog, Apple HLS Authoring Specification, YouTube Music Help.

When stereo on video does need a second rung

The single-rendition default has one important exception worth naming, because it is the case engineers cite when they push back. The arithmetic that makes audio a "small passenger" assumed the audio bitrate is small relative to the smallest video rung. On a ladder that goes all the way down to a 200 kbit/s video rung for the weakest connections, a 128 kbit/s stereo track is no longer 2% of the stream — it is nearly 40% of it, and now adapting the audio matters. Peer-reviewed work on adaptive streaming with separate audio and video tracks makes exactly this point: when a high-quality audio track approaches the bitrate of the lowest video rungs, audio rate adaptation becomes as important as video adaptation for the viewer's experience.

The practical rule: if your lowest video rung is below roughly 600 kbit/s — common for services targeting 2G/3G regions or very constrained devices — add a second, lower audio rendition (say 64 kbit/s alongside your 128 kbit/s default) so the player can shed audio bits when it is scraping the bottom of the ladder. If your lowest video rung is comfortably above 1 Mbit/s, do not bother; one stereo rendition is right.

Pitfall — copying the video rung count onto audio. The most common and most expensive audio-ladder mistake is "we have six video rungs, so make six audio rungs." Six stereo AAC rungs multiplies your audio encoding and storage sixfold to save bandwidth no network will notice and to add quality no ear will hear. Worse, every extra audio rendition multiplies your QC and packaging surface in every language you ship — a problem we quantify in storage and CDN math for audio. Pick the rung count from the two-case rule, not from the video ladder.

Where Fora Soft fits in

Across the video streaming, OTT, e-learning, and telemedicine products we have built since 2005, the audio-ladder question comes up the same way every time: an engineer reaches for the video pattern and proposes more audio rungs than the catalogue needs. For a stereo-only e-learning or telemedicine service we standardise on a single well-chosen rendition and spend the saved encoding budget on getting loudness and language switching right instead. For an OTT product shipping surround we build the short Case-1 ladder — a surround rung or two plus a stereo floor — and validate the fallback on real living-room hardware, because the device that cannot decode Dolby is the one that exposes a missing floor. The decision is always the arithmetic in this article, run against the specific bitrate ladder of the project.

What to read next

Call to action

References

  1. Apple, HTTP Live Streaming (HLS) Authoring Specification for Apple Devices (current revision, accessed 2026-06-05). Controlling source for the stereo-AAC 32–160 kbit/s recommendation, the 5.1 guidance (Dolby Digital 384 kbit/s / Dolby Digital Plus 192 kbit/s), and the separate-rendition-group rule. https://developer.apple.com/documentation/http-live-streaming/hls-authoring-specification-for-apple-devices — Tier 1 (vendor specification; controlling for HLS authoring).
  2. ISO/IEC 23009-1, Information technology — Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats (current edition). Defines the AdaptationSet/Representation model in which each audio rung is a Representation. https://www.iso.org/standard/83314.html — Tier 1 (official standard; full normative text paywalled, cited from catalogue summary).
  3. ISO/IEC 23000-19, Common Media Application Format (CMAF) for segmented media (current edition). The shared segment format that lets one audio ladder feed both HLS and DASH. https://www.iso.org/standard/85623.html — Tier 1 (official standard; paywalled, cited from catalogue summary).
  4. Netflix Technology Blog, "Engineering a Studio Quality Experience With High-Quality Audio at Netflix" (2019, accessed 2026-06-05). Primary source for Netflix's 192–640 kbit/s 5.1 Dolby Digital Plus ladder, the 640 kbit/s perceptual-transparency threshold (10:1 vs a 24-bit master), the 768 kbit/s Atmos top rung, and the stereo-AAC/AC-3 fallback. https://netflixtechblog.com/engineering-a-studio-quality-experience-with-high-quality-audio-at-netflix-eaa0b6145f32 — Tier 4 (first-party engineering blog of the deployer).
  5. ETSI TS 102 366, Digital Audio Compression (AC-3, Enhanced AC-3) Standard (current revision). Controlling source for Dolby Digital and Dolby Digital Plus, the codecs in Netflix's and Apple's surround ladders. https://www.etsi.org/deliver/etsi_ts/102300_102399/102366/ — Tier 1 (official standard).
  6. ISO/IEC 14496-3, Information technology — Coding of audio-visual objects — Part 3: Audio (2019 edition). Controlling source for AAC, the stereo codec in every ladder above. https://www.iso.org/standard/76383.html — Tier 1 (official standard; paywalled, cited from catalogue summary).
  7. YouTube Music Help, "Change your audio quality" and the YouTube music-video encoding specifications (accessed 2026-06-05). Source for the 48/128/256 kbit/s tiers and the AAC-on-mobile / Opus-on-web split. https://support.google.com/youtubemusic/answer/9076559 — Tier 4 (first-party platform documentation).
  8. Y. Qin et al., "ABR Streaming with Separate Audio and Video Tracks" (academic paper, University of Connecticut, accessed 2026-06-05). Peer-reviewed source for the claim that audio rate adaptation matters once a high-quality audio track approaches the bitrate of the lowest video rungs. https://nlab.engr.uconn.edu/papers/Qin19-demuxed.pdf — Tier 5 (peer-reviewed academic publication).