Published: 2026-06-06 · Reading time: 18 min read · Author: Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you stream live sport, auctions, betting, game shows, trading floors, or any event where viewers compare notes in real time — over a phone, a browser, or a smart TV — latency is a product feature, and your audio either keeps up with the picture or it does not. This article is for a product manager, founder, or operations lead with no streaming background who needs to understand why their "low-latency" stream still drifts out of sync, why the audio sometimes clicks or drops at the start of a buffer, and how to talk to engineers about the fix. By the end you will know what a part and a chunk are, why audio frame sizes fight with sub-second segmenting, and which packaging settings decide whether your sound arrives on time. Every mechanism traces back to the controlling specification — the HLS draft RFC, ISO/IEC 23009-1 for DASH, ISO/IEC 23000-19 for CMAF, and the codec standards — not to a vendor blog.
The one idea: small pieces, delivered early
Start with the mental model the rest of the article hangs on. Ordinary internet streaming has high latency for a boring reason: the player will not start a media segment until the segment is fully written on the server. If your segments are six seconds long, the newest thing a viewer can see is already at least six seconds old before the download even begins, and a safety buffer of two or three segments pushes the real number to twenty or thirty seconds. That is fine for a movie. It is unacceptable for a goal you can hear your neighbours cheering through the wall before your screen shows it.
Low-latency streaming fixes this with one move: cut each segment into much smaller pieces and let the player fetch a piece the instant the encoder produces it, while the rest of the segment is still being written. Apple's Low-Latency HLS calls these pieces parts; chunked CMAF, the format that powers Low-Latency DASH, calls them chunks. Both are roughly 200 to 500 milliseconds long. Instead of waiting six seconds for a whole segment, the player gets fresh media a couple of times a second, and glass-to-glass latency drops from the twenty-to-thirty-second floor of plain HLS or DASH to a two-to-five-second range competitive with cable television.
That single idea — small pieces, delivered early — is the whole game. Everything else in this article is a consequence of trying to make audio fit into pieces that small.
Figure 1. Standard streaming waits for the whole segment; low-latency streaming delivers sub-second parts as the encoder makes them. The audio track is split on the same clock as the video.
Why audio is the awkward passenger
Video and audio are not encoded the same way, and the difference is exactly what makes low-latency audio tricky.
Video encoders are flexible about where a frame lands. The encoder can place an independent frame — a full picture the player can decode without any earlier frames — wherever the packager asks, and parts are cut at those points. Audio has no such freedom. Audio codecs encode in fixed-size frames, and a frame is the smallest unit you can decode. You cannot cut an audio frame in half; you must ship whole frames.
The number that matters is the frame's duration, and it is awkward by design. The most common streaming codec, AAC-LC, encodes 1,024 audio samples per frame. At the standard video sample rate of 48,000 samples per second, that is:
1,024 samples ÷ 48,000 samples per second = 0.021333… seconds
= 21.33 milliseconds per AAC frame
So one AAC frame is about 21.3 milliseconds long. Opus, the codec WebRTC uses and an increasingly common streaming choice, defaults to 20-millisecond frames. Neither number divides evenly into a 200-, 300-, or 500-millisecond part. A 300-millisecond part holds 300 ÷ 21.33 ≈ 14.06 AAC frames — not a whole number. The packager cannot ship 14.06 frames, so it ships either 14 frames (about 298.7 ms of audio) or 15 frames (about 320 ms), and the audio part is now slightly out of step with the video part it is supposed to accompany.
This is the root of low-latency audio's reputation for being fiddly. The video timeline is cut on neat sub-second boundaries; the audio timeline is cut on a 21.33-millisecond grid that never lines up with those boundaries. Somebody — the packager — has to reconcile the two, and how it does so determines whether your audio stays in sync.
Figure 2. AAC frames are 21.33 ms each; a 300 ms part holds 14.06 of them. The packager rounds to whole frames, so audio and video part boundaries drift unless the segment duration is chosen to make audio frames land evenly.
How LL-HLS carries audio: parts and preload hints
Apple introduced Low-Latency HLS in 2020 as an extension to the HLS specification, now tracked in the IETF draft draft-pantos-hls-rfc8216bis. It adds a handful of tags to the media playlist, and the same tags apply to audio renditions as to video.
The core new tag is EXT-X-PART, which identifies a partial segment. Where ordinary HLS lists whole segments with EXTINF, LL-HLS additionally lists the parts that make up the segment currently being produced, each with its own DURATION and URI. An audio rendition playlist gets its own EXT-X-PART entries, on the same cadence as the video, so the player can fetch audio parts as quickly as video parts.
Two more tags make the cadence work. EXT-X-PRELOAD-HINT sits at the head of the playlist and tells the player the URL of the next part before that part exists, so the player can issue the request and have it waiting; the server holds the request open and releases the bytes "at line speed" the instant the part is ready. EXT-X-SERVER-CONTROL advertises that the server supports these blocking behaviours and declares the PART-HOLD-BACK — how far from the live edge the player should stay, typically three times the part duration. Together they replace the old, wasteful pattern where the player repeatedly polled the playlist hoping new media had appeared.
For audio specifically, one attribute earns its keep: INDEPENDENT. On a video part, INDEPENDENT=YES means the part begins with an independent frame the player can start decoding from. Audio frames are essentially all independently decodable, but the attribute still matters because the player uses it to decide where it can join a stream or switch renditions cleanly. When you switch audio language or bitrate mid-stream, the player looks for a safe entry point, and correct INDEPENDENT tagging keeps that switch from producing a gap or a glitch.
Here is the shape of a low-latency audio rendition playlist (trimmed):
#EXTM3U
#EXT-X-VERSION:9
#EXT-X-TARGETDURATION:4
#EXT-X-SERVER-CONTROL:CAN-BLOCK-RELOAD=YES,PART-HOLD-BACK=1.0,CAN-SKIP-UNTIL=12.0
#EXT-X-PART-INF:PART-TARGET=0.33334
#EXT-X-MEDIA-SEQUENCE:200
# ...full segments above...
#EXTINF:4.00000,
seg200.mp4
# parts of the segment currently being produced:
#EXT-X-PART:DURATION=0.33334,URI="seg201.0.mp4",INDEPENDENT=YES
#EXT-X-PART:DURATION=0.33334,URI="seg201.1.mp4"
#EXT-X-PART:DURATION=0.33334,URI="seg201.2.mp4"
#EXT-X-PRELOAD-HINT:TYPE=PART,URI="seg201.3.mp4"
Notice the part target is 0.33334 seconds — 333 milliseconds. In AAC frames that is 333.34 ÷ 21.33 ≈ 15.6 frames, which is still not whole. Apple's own guidance is to choose a part duration that contains a whole number of audio frames, or to accept that part durations will vary slightly so each part holds an integer count of frames. That variation is legal — the DURATION on each EXT-X-PART is exactly what the part holds — but it means your audio parts and video parts will not share identical durations, and your tooling has to be comfortable with that.
How LL-DASH carries audio: chunked CMAF
MPEG-DASH reaches low latency by a different route that arrives at the same place. Instead of inventing a new "part" concept in the manifest, Low-Latency DASH keeps the ordinary segment but changes how the segment is built and delivered. The segment is encoded as a sequence of CMAF chunks — each chunk is a moof box (the metadata header) plus an mdat box (the media payload) carrying a couple of hundred milliseconds of media — and the server publishes each chunk over HTTP/1.1 chunked transfer encoding the instant the encoder produces it. The player requests the segment with an open-ended request and receives the chunks as a stream, playing each one as it arrives rather than waiting for the segment to close.
The manifest's job is to tell the player that the chunks are available early. The attribute that does this is availabilityTimeOffset (ATO), a decimal number of seconds that says how far before a segment's nominal availability time its first bytes can already be fetched. If a segment nominally becomes available at the four-second mark but its first chunk is ready 3.6 seconds early, you set availabilityTimeOffset="3.6", and a low-latency player knows to start pulling the segment almost immediately.
Audio gets its own treatment here, and this is where LL-DASH is honest about audio being the awkward passenger. The DASH-IF low-latency guidance explicitly allows the audio AdaptationSet to run with a different availability schedule from video — for example, video may set availabilityTimeOffset aggressively while audio sets it more conservatively or not at all. The reason is the frame-grid problem from earlier: audio cannot always produce a clean chunk on the same sub-second boundary as video, so forcing audio onto the video's aggressive chunk schedule risks shipping incomplete or misaligned audio. Running audio on a slightly looser schedule trades a few tens of milliseconds of audio latency for reliable, gap-free sound — a trade almost always worth making, because viewers forgive audio that is a hair behind far more readily than audio that clicks or drops.
Figure 3. In LL-DASH the segment is built from CMAF chunks (moof + mdat) and streamed as each chunk lands. Audio can run on its own availabilityTimeOffset so its coarser frame grid does not force gaps.
CMAF-LL is the shared floor under both
The reason LL-HLS and LL-DASH can be discussed together is that both sit on top of the same media format: CMAF, the Common Media Application Format, standardised as ISO/IEC 23000-19. CMAF defines the fragmented-MP4 boxes that hold the media, and the "low-latency" addition is the CMAF chunk — a moof+mdat pair smaller than a full fragment. One physical library of CMAF chunks can be addressed by an LL-HLS playlist (as parts, via byte-range requests) and by an LL-DASH manifest (as chunked segments) at the same time.
This convergence got much stronger when Apple removed the original LL-HLS requirement for HTTP/2 PUSH and replaced it with the preload-hint-plus-blocking pattern. After that change, an LL-HLS client can issue an open-ended byte-range request against the start of a segment — "give me byte 0 onward and keep sending as more arrives" — which behaves almost exactly like the chunked-transfer stream an LL-DASH client expects. The practical result for audio is that you encode and package your audio once, as CMAF chunks, and both delivery formats can serve it. You do not maintain two separate audio pipelines.
The one rule that makes this work is alignment: the audio chunks must use boundaries that both formats can address. Choose a segment duration that holds a whole number of audio frames, keep the chunk duration consistent, and the same audio chunks serve LL-HLS parts and LL-DASH segments without re-packaging.
Audio is the latency floor nobody measures
Here is the part most low-latency projects discover the hard way. Teams obsess over video latency — part duration, chunk size, CDN behaviour — and treat audio as a solved afterthought. Then the stream ships and the audio is reliably a beat behind the picture, and nobody can say why. The answer is that audio carries delay the video pipeline does not, and that delay sets a floor you cannot tune away with smaller parts.
The biggest culprit is encoder delay, also called priming. To start cleanly, an AAC encoder prepends silent "priming" samples to the front of the stream — historically 2,112 samples, chosen because that was the common encoder delay across shipping AAC implementations. A correct decoder discards those 2,112 samples on playback, but the delay they represent — about 44 milliseconds at 48 kHz — is real, and it sits in the audio path before a single part is cut. Different codecs carry different amounts: Opus has a low algorithmic delay of about 26.5 milliseconds at its default 20-millisecond frame size, which is part of why it is attractive when latency is the priority. The low-delay AAC variant, AAC-ELD, was designed expressly to cut this figure for conferencing.
The second culprit is the frame grid we have already met. Because audio ships in whole 21.33-millisecond frames, the audio part is rounded to a frame boundary, adding up to one frame of slack versus the video part. The third is the playout buffer: the player holds a small audio buffer to smooth out arrival jitter, and at the very low latencies LL targets, that buffer is the dominant tuning knob for stability versus delay.
Add them up and audio's structural floor — encoder priming plus frame-grid rounding plus a minimal playout buffer — is commonly in the 60-to-120-millisecond range before any network delay. That is below the threshold where most viewers notice lip-sync error (roughly 45 ms of audio-ahead to 125 ms of audio-behind, per the broadcast tolerance windows), but only if the player aligns presentation timestamps correctly. When audio runs late and the player does not compensate, you fall outside the window and viewers feel that the dub is "off", even though every individual component is behaving.
Common mistake — chasing video latency while ignoring the audio floor. A team drops part duration from 500 ms to 200 ms to win a second of video latency, then files a bug that audio now lags. The audio floor (priming + frame rounding + buffer) did not shrink with the parts; it just became a larger fraction of the total, so it is suddenly visible. The fix is not smaller audio parts — you cannot subdivide a 21.33 ms frame — but aligning presentation timestamps and accepting that audio runs on a coarser grid than video.
A worked latency budget
Put numbers on it. Suppose you target a three-second glass-to-glass latency with chunked CMAF, 333-millisecond parts, and AAC-LC audio at 48 kHz. A rough audio budget looks like this:
| Stage | Delay | Note |
|---|---|---|
| AAC encoder priming | ~44 ms | 2,112 samples ÷ 48,000, discarded on decode but present in the path |
| Frame-grid rounding | up to ~21 ms | one AAC frame of slack at the part boundary |
| Encode + packaging | ~30–60 ms | encoder lookahead + chunk assembly |
| Network + CDN | ~100–300 ms | depends on edge proximity and chunked-transfer support |
| Player playout buffer | ~150–400 ms | the main stability-vs-latency knob |
| Audio path subtotal | ~350–800 ms | before any deliberate live-edge hold-back |
The arithmetic that matters: the encoder priming and frame-grid terms — roughly 65 milliseconds combined — are fixed by the codec, not by your part size. You can shrink parts and buffers, but you cannot shrink those. That is why "audio is the latency floor": the irreducible audio terms become the limit you converge toward as you tune everything else down.
Codec choice changes the trade
The codec you pick moves the floor. AAC-LC is universal and hardware-decoded everywhere, but it carries the 2,112-sample priming delay and the 21.33-millisecond frame grid. Opus, increasingly supported in CMAF and DASH, offers smaller frame sizes (down to 2.5 ms) and a lower algorithmic delay, so it lets you cut the audio floor when your target devices decode it. For broadcast-grade immersive audio, AC-4 and MPEG-H carry their own delay characteristics and are usually reserved for premium tiers rather than the lowest-latency path.
The practical decision tree is short. If you must reach Apple devices through native HLS, AAC is the safe default and you tune around its floor. If you control the player or target the open web with Opus support, Opus buys you a lower floor and smaller frames. If you need ultra-low latency below what chunked streaming gives — sub-second, conversational — you have left streaming territory entirely and should be looking at WebRTC, where the audio pipeline is built for it.
Where Fora Soft fits in
We build live and interactive video where latency is the product, not a nice-to-have — sports and betting platforms, auction and trading apps, e-learning with live participation, and telemedicine where a clinician and patient need to feel present. In those projects the audio path is where the sync budget is won or lost, and we have spent real engineering hours reconciling the audio frame grid with sub-second video parts, tuning playout buffers for stability at two seconds, and choosing between AAC and Opus per device target. When a client says "the low-latency stream drifts", the answer is almost always in how audio was packaged into the parts, not in the network.
What to read next
- Audio in HLS, DASH, CMAF: how a streaming player picks an audio track
- Audio adaptive bitrate ladders: do you actually need them?
- Frames, packets, granules: why audio is chunked
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your audio in low latency streaming plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Low-Latency Audio — Part-Sizing Cheat Sheet — One-page reference with the per-codec audio frame math (AAC-LC 1024 samples / 21.33 ms, Opus 20 ms), the integer-segment alignment hint (8 s = 375 AAC frames at 48 kHz), the irreducible audio latency floor (priming + frame rounding +….
References
- HTTP Live Streaming 2nd Edition,
draft-pantos-hls-rfc8216bis(IETF, active revision of RFC 8216).EXT-X-PART,EXT-X-PRELOAD-HINT,EXT-X-SERVER-CONTROL(CAN-BLOCK-RELOAD,PART-HOLD-BACK),EXT-X-PART-INF(PART-TARGET), and theINDEPENDENTattribute on partial segments. Tier 1. https://datatracker.ietf.org/doc/html/draft-pantos-hls-rfc8216bis - RFC 8216, HTTP Live Streaming (IETF, Pantos & May, August 2017). The stable published HLS baseline that the low-latency draft extends. Tier 1. https://www.rfc-editor.org/rfc/rfc8216.html
- ISO/IEC 23009-1:2022, Dynamic Adaptive Streaming over HTTP (DASH) — Part 1 (ISO/IEC).
AdaptationSet,Representation,availabilityTimeOffset, and the low-latency live profile. Paywalled; paraphrased from the catalogue/abstract and the DASH-IF community guidance. Tier 1. https://www.iso.org/standard/83314.html - DASH-IF Interoperability: Low-Latency Live Streaming (DASH Industry Forum, Community Review). Chunked CMAF over HTTP/1.1 chunked transfer;
availabilityTimeOffsetmechanics; the allowance for audioAdaptationSets to run on a different availability schedule from video. Tier 3 (industry consensus guidance over the ISO base). https://dashif.org/docs/DASH-IF-IOP-CR-Low-Latency-Live-Community-Review.pdf - ISO/IEC 23000-19, Common Media Application Format (CMAF) (ISO/IEC). The fragmented-MP4 chunk (
moof+mdat) that both LL-HLS parts and LL-DASH segments address. Paywalled; paraphrased from the catalogue. Tier 1. https://www.iso.org/standard/85623.html - ISO/IEC 14496-3, Information technology — Coding of audio-visual objects — Part 3: Audio (ISO/IEC). AAC-LC frame length of 1,024 samples; AAC-ELD low-delay variant. Paywalled; the 1,024-sample frame is corroborated by the Apple and Fraunhofer sources below. Tier 1. https://www.iso.org/standard/76383.html
- IETF RFC 6716, Definition of the Opus Audio Codec (IETF, September 2012). Frame sizes of 2.5–60 ms; default 20 ms; ~26.5 ms algorithmic delay. Updated by RFC 8251. Tier 1. https://www.rfc-editor.org/rfc/rfc6716.html
- Technical Note TN2258 / "Audio Priming — Handling Encoder Delay in AAC" (Apple Developer). The 2,112-sample AAC priming/encoder delay and its discard on decode. Tier 3 (vendor doc corroborating the spec-level encoder-delay concept). https://developer.apple.com/documentation/quicktime-file-format/appendix_g_audio_priming_handling_encoder_delay_in_aac
- "An update on Low Latency HLS live streaming" (Mux, Phil Cluff; updated 2025). The removal of HTTP/2 PUSH, the preload-hint-plus-blocking model, and LL-HLS / LL-DASH interoperability via open byte-range requests. Tier 4 (production-deployer blog; used for deployment context, never overriding the HLS draft). https://www.mux.com/blog/low-latency-hls-part-2
- "A guideline to audio codec delay" (Fraunhofer IIS, AES 116th Convention). Algorithmic delay of AAC core and SBR; AAC-ELD delay figures. Tier 5 (peer-reviewed convention paper, primary for the algorithmic-delay numbers). https://www.iis.fraunhofer.de/content/dam/iis/de/doc/ame/conference/AES-116-Convention_guideline-to-audio-codec-delay_AES116.pdf


