Why this matters
If you run an OTT service, a webinar replay platform, an e-learning library, or any video-on-demand product, "the sound is slightly behind the picture" is one of the most damaging complaints you can get, because it makes professionally produced content feel cheap. The frustrating part is that the same file often plays perfectly in one player and drifts in another, which sends teams chasing the player when the real fault is in how the content was packaged. A product manager needs to understand why a stream that passed QA on a desktop browser falls out of sync the moment an ad rolls, and an engineer needs to know which knob — the edit list, the period offset, the discontinuity sequence — actually controls the timing. This article gives both of you the mental model to find the fault fast and tell the packaging team what to change.
Figure 1. Audio and video are encoded and packaged as separate tracks, then re-aligned at the player. Each hand-off is a place the timing can slip — and most failures are introduced before the segments ever leave the packager.
First, what "in sync" actually requires
In real-time calling, a receiver lines up two live streams against a shared wall clock. Adaptive streaming is different: there is no live capture clock to anchor to, and the player is not receiving a single muxed file. It is downloading a list of small segments — typically two to ten seconds each — for the video track, and a separate list of segments for each audio track, and it has to lay them onto one internal timeline so the right sound comes out when the right frame appears.
The number that defines "right" is the same one broadcast uses. The standard ITU-R BT.1359-1 sets the limit of acceptability at audio leading the picture by about 90 milliseconds, or lagging it by about 185 milliseconds. Inside that window almost nobody notices. Outside it, the speaker looks dubbed and the product looks broken. So the player's job is not perfect zero-offset alignment; it is keeping the offset inside that window for the whole playback, including across quality switches and ad breaks.
To do that, the player relies entirely on the timestamps written into each segment. There is no separate "sync track". The presentation timestamp, called the PTS, tells the player at what moment on the media timeline each audio sample and each video frame is meant to be presented. If the PTS values on the audio and video segments describe the same timeline correctly, sync is automatic. If anything along the chain shifts one track's PTS relative to the other, the player faithfully reproduces the error.
The two container worlds: MPEG-TS and fragmented MP4
Almost every sync bug in HLS and DASH traces back to which container carries the media, so it is worth being precise about the two.
Historically, HLS used MPEG transport stream segments, the .ts files. MPEG-TS came from broadcast, where streams are continuous and there is no concept of a file with a defined start and end. The modern world uses fragmented MP4 — the .m4s or .mp4 segments built on the ISO base media file format, which is the basis of the CMAF packaging that HLS and DASH now share. DASH has always used fragmented MP4. HLS has supported it since 2016, and the Common Media Application Format, standardized as ISO/IEC 23000-19, now lets a single set of fragmented-MP4 segments serve both protocols, with only the manifest differing — .m3u8 for HLS, .mpd for DASH.
This difference is not academic. Fragmented MP4 has a feature called the edit list — a small box inside the file, named elst, that says "trim this many samples off the front before presenting." MPEG-TS has no equivalent. As you will see in a moment, that single missing feature is the root cause of the most common silent sync error in HLS.
Where it breaks #1: audio priming
This is the failure almost nobody expects, because it is built into how audio compression works.
When an audio encoder — AAC is the usual one in streaming — starts compressing a signal, its filters need a running start. They cannot produce a correct first output sample until they have processed some input, so the encoder prepends a short run of silent samples called priming, also called encoder delay. The decoder reproduces this silence at the start, which means the decoded audio is shifted later than it should be by exactly the priming length. The amount is encoder-specific and precise: Apple's AAC encoder uses 2112 samples by default, the widely used FDK-AAC encoder typically produces 2048, and FFmpeg's native AAC encoder produces 1024. At a 48,000 Hz sample rate, walk the arithmetic out:
priming delay (seconds) = priming samples ÷ sample rate
2112 samples ÷ 48,000 Hz = 0.044 seconds = 44 milliseconds
Forty-four milliseconds of audio lag, baked in at the encoder, before the content has gone anywhere. That is half the ITU-R lead tolerance consumed by a problem you did not know you had.
In a fragmented-MP4 world this is handled cleanly. The packager writes an edit list — the elst box with a media_time equal to the priming sample count — and a compliant player reads it and trims the priming before presenting, so the audio lands where it should. This is the same mechanism that delivers gapless playback in music files. The trouble is that MPEG-TS cannot carry an edit list. When HLS packages audio into .ts segments, the priming silence has nowhere to be described, so it plays out as real audio. The result is a small, fixed audio delay that the player has no instruction to remove.
Pitfall — the "it's fine in Safari" trap. A real, documented case in the hls.js project showed a source MP4 whose video track carried an edit-list
media_timeof 1024 and whose audio track carried 2048. Safari, which understands the edit, rendered it one way; browser players using Media Source Extensions buffered the video with a start gap equal to the first frame's composition offset, about 83 milliseconds on a 24 fps clip, and the renditions no longer lined up with each other. Same content, same manifest, different player behaviour — which is exactly why teams misdiagnose this as a player bug instead of a packaging choice.
The fix lives entirely in packaging. Either move to fragmented-MP4 / CMAF segments so the edit list survives, or make sure the encoder applies a consistent, declared priming across every audio rendition so the offset is at least uniform. The failure mode you must avoid is different priming on different renditions: when the player switches audio quality and the priming changes, the audio jumps relative to the video mid-stream.
Where it breaks #2: renditions that don't share the same edit
Adaptive streaming offers several quality levels — several video bitrates, sometimes several audio bitrates — and the player switches between them as bandwidth changes. For sync to survive a switch, every rendition must describe the same timeline. If the highest-quality video has a first-frame composition offset of 83 milliseconds and the lowest-quality video was encoded without it, the moment the player drops to the low rendition the video shifts relative to the audio.
This is the rendition-alignment problem, and it is insidious because it only appears under network stress, which is the one condition QA on a fast office connection never reproduces. The player aligns segments using the first frame's presentation time, and if some renditions carry an edit or a composition offset that others don't, the alignment is inconsistent and the buffer develops gaps at switch points. The hls.js maintainers put the rule plainly: media engineers should provide mezzanines with consistent edits across all qualities. Sync is not only about audio-versus-video; it is also about video-versus-video across the bitrate ladder.
Where it breaks #3: ad insertion
If audio priming is the most common quiet error, ad insertion is the most common loud one — the break where sync visibly falls apart.
The reason is that an ad is different content, encoded separately, by a different system, often with a different audio encoder and therefore a different priming, spliced into the main timeline at runtime. HLS and DASH mark this splice in different ways, and each has its own failure mode.
In HLS, the splice is marked with an EXT-X-DISCONTINUITY tag. A discontinuity tells the player "the timeline you have been following stops here; the next segment starts a new timing context with its own timestamps." The player resets and re-bases its clock on the new segment's PTS. This is correct and necessary — the ad's timestamps have no relationship to the main content's — but it means every assumption the player made about audio-video alignment is rebuilt from scratch at the boundary. If the ad's audio carries a different priming, or its segments are not internally aligned, the sync that was fine through the main content can be wrong through the ad and stay wrong after it.
In DASH, the equivalent is a new Period. The MPD describes the main content as one Period and the ad as another, and each segment's presentation time is relative to the start of its own Period, not to the start of the whole presentation. Here is where a specific number, the presentationTimeOffset, decides whether sync survives — and getting it wrong is the single most common DASH ad-insertion bug.
The DASH presentationTimeOffset, worked out
The player places each segment in its internal buffer using the segment's own earliest presentation time plus an offset it computes from the manifest. The W3C Media Source Extensions API exposes this as timestampOffset. The relationship the player uses is:
buffer position = MSE.timestampOffset + segment earliest presentation time
MSE.timestampOffset = Period@start − Period@presentationTimeOffset
Take a concrete midroll: 8 seconds of main content, then a 4-second ad, then more main content. The third Period — the main content resuming — does not restart its segment timeline at zero; its first segment carries an earliest presentation time of 8 (where it left off). If the packager sets only the Period start and leaves presentationTimeOffset at zero, the player computes:
Period@start = 12, presentationTimeOffset = 0
buffer position = (12 − 0) + 8 = 20 seconds
The segment that should sit at 12 seconds on the timeline lands at 20. The buffer now has an 8-second hole, the player stalls or jumps the gap, and audio and video can end up mapped to different positions — the classic "sync gets worse after every ad break" report. Set the offset correctly, to the earliest presentation time of that Period's first segment:
Period@start = 12, presentationTimeOffset = 8
buffer position = (12 − 8) + 8 = 12 seconds
Now the segment lands exactly where it belongs and the timeline is continuous. The Fraunhofer dash.js team, who maintain the reference DASH player, identify this exact offset error as one of the most frequent causes of broken multi-period playback. The rule from their guidance: presentationTimeOffset must equal the earliest presentation time of the Period's first segment, expressed in the track's own timescale. Forget the timescale and the number is off by orders of magnitude.
Figure 2. The same midroll, two manifests. On the left,
presentationTimeOffset is left at zero and the resumed content lands eight seconds too late. On the right, the offset equals the first segment's earliest presentation time and the timeline stays continuous.
Where it breaks #4: gaps in the buffer
Both of the above produce the same downstream symptom — a discontinuity the player's media buffer cannot bridge. Browser-based players built on Media Source Extensions are strict: most cannot tolerate a gap in the buffered timeline and will stall the instant playback reaches one. The Fraunhofer team lists two main causes of these gaps: periods or segments that do not align at their boundaries, and segments whose summed sample duration is shorter than the duration the manifest advertises.
That second cause is subtle and worth naming. If a manifest says a segment is exactly 2.000 seconds long but the audio samples inside add up to 1.998 seconds — because the sample count and the sample rate do not divide evenly into the segment duration — every segment leaves a 2-millisecond hole. One hole is nothing. Six hundred segments into a long video-on-demand asset, the accumulated audio shift is over a second, and the audio drifts steadily behind the video. This is mechanically identical to the drift you correct with resampling or frame drops in real-time pipelines, except here it is authored into the segments and no amount of player cleverness fully removes it.
Players defend against gaps with gap-jumping logic — dash.js and hls.js both have it — but a jump is a patch over a packaging fault, not a fix. It papers over small holes at the cost of a tiny visible skip, and it cannot recover the sync relationship that the hole destroyed.
The single highest-leverage decision: encode once with CMAF
Most of these failure modes share a root cause: audio and video, or the HLS and DASH variants of the same content, were produced by separate processes that did not agree on a timeline. The structural fix is to stop producing parallel packagings.
A single CMAF encode produces one set of fragmented-MP4 segments, with one consistent set of timestamps and edit lists, that both an HLS manifest and a DASH manifest point at. The audio and video are aligned once, at encode time, and that alignment is identical no matter which protocol the player speaks. Edit lists survive because fragmented MP4 carries them. The bitrate ladder shares a timeline because it was cut from one mezzanine. You remove an entire class of "fine in DASH, broken in HLS" bugs by never letting the two formats diverge in the first place. This is why the industry consensus in 2026 is to deliver both protocols from a single CMAF source rather than maintaining two packaging pipelines.
CMAF does not magically fix ad insertion — a spliced ad is still separate content and still needs correct discontinuity and period handling — but it removes the container-mismatch and rendition-mismatch errors that cause most of the rest.
How to diagnose it: a short decision path
When a stream is out of sync, the symptom tells you where to look. A small, constant audio lag from the very first second, identical on every player that doesn't honour edit lists, is almost always audio priming in an MPEG-TS or edit-list-stripped pipeline. Sync that is fine in the main content but breaks at the first ad and gets worse with each subsequent break is a discontinuity or period-offset error. A steady, slow drift that is imperceptible at the start and obvious twenty minutes in is segment-duration rounding accumulating. Sync that is fine on a fast connection and breaks only when the network forces a quality switch is a rendition-alignment problem — inconsistent edits across the ladder.
Notice what is not on that list: the codec, the CDN, and the player are rarely the actual fault, even though they are usually blamed first. The fault is almost always a number written, or not written, at the packager.
A comparison table: HLS vs DASH sync mechanics
| Aspect | HLS | DASH |
|---|---|---|
| Original container | MPEG-TS (.ts), now fMP4/CMAF |
Fragmented MP4 from the start |
| Edit list / priming trim | Lost in MPEG-TS, preserved in fMP4 | Preserved (fMP4) |
| Ad-break marker | EXT-X-DISCONTINUITY |
New Period |
| Timeline reset control | Discontinuity sequence | Period@start + presentationTimeOffset |
| Most common sync bug | Priming exposed in .ts; rendition edit mismatch |
Wrong presentationTimeOffset across periods |
| Shared modern fix | Single CMAF encode (ISO/IEC 23000-19) | Single CMAF encode (ISO/IEC 23000-19) |
Where Fora Soft fits in
We build streaming and OTT products where this matters in production, not in theory — video-on-demand libraries for e-learning, live and recorded streaming for media platforms, telemedicine systems where a clinician reviews recorded consultations, and video surveillance playback. In each of these, lip-sync that holds across quality switches and ad or chapter boundaries is the difference between a product that feels finished and one that feels broken. Most of the sync work we do is not in the player at all; it is making sure the packaging pipeline writes consistent edit lists, shares one CMAF encode across HLS and DASH, and handles period and discontinuity boundaries correctly. That is the layer where streaming lip-sync is actually won or lost.
What to read next
- Audio in HLS, DASH, CMAF: how a streaming player picks an audio track
- What "in sync" means: ITU-R BT.1359, lip-sync windows, perceptibility
- PTS, DTS, and the elementary stream timestamp
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your hls dash lip sync plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the HLS / DASH Lip-Sync Debugging Checklist — One page: the four failure modes (priming, rendition edits, ad-insertion period/discontinuity reset, segment-duration drift), the DASH presentationTimeOffset formula, the symptom-to-cause decision path, and why the fix is at the packager.
References
- ITU-R BT.1359-1, Relative Timing of Sound and Vision for Broadcasting (1998) — the lip-sync tolerance window (audio lead ≈ 90 ms, lag ≈ 185 ms). Official ITU-R Recommendation. https://www.itu.int/rec/R-REC-BT.1359
- ISO/IEC 23009-1:2022, Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats — Period structure,
@presentationTimeOffset, segment timeline. Official ISO/IEC standard. https://standards.iso.org/ittf/PubliclyAvailableStandards/ - ISO/IEC 23000-19:2024, Common media application format (CMAF) for segmented media — fragmented-MP4 packaging shared by HLS and DASH. Official ISO/IEC standard. https://www.iso.org/standard/85623.html
- Apple Developer Documentation, Audio Priming — Handling Encoder Delay in AAC (QuickTime File Format, Appendix G) — 2112-sample default priming, edit-list trimming. First-party vendor specification. https://developer.apple.com/documentation/quicktime-file-format/appendix_g_audio_priming_handling_encoder_delay_in_aac
- Apple Developer, HTTP Live Streaming (HLS) Authoring Specification for Apple Devices — rendition alignment,
EXT-X-DISCONTINUITY, fMP4 support. First-party vendor specification. https://developer.apple.com/documentation/http-live-streaming/hls-authoring-specification-for-apple-devices - IETF, HTTP Live Streaming 2nd Edition (draft-pantos-hls-rfc8216bis) — the
EXT-X-DISCONTINUITYtag and discontinuity sequence semantics. IETF Internet-Draft (work in progress; cited by draft date). https://datatracker.ietf.org/doc/html/draft-pantos-hls-rfc8216bis - W3C, Media Source Extensions™ Recommendation —
SourceBuffer.timestampOffset, buffer positioning, gap handling. W3C Recommendation. https://www.w3.org/TR/media-source/ - Daniel Silhavy & Stefan Pham (Fraunhofer FOKUS, dash.js maintainers), Common pitfalls in MPEG-DASH streaming (2020) — multi-period
presentationTimeOffsetarithmetic and buffer-gap causes. First-party engineering source for the reference DASH player. The article follows ISO/IEC 23009-1 where the blog simplifies. https://websites.fraunhofer.de/video-dev/common-pitfalls-in-mpeg-dash-streaming/ - video-dev/hls.js Issue #2468, Align segments on first DTS rather than PTS — documented edit-list / composition-offset rendition-misalignment case. Production engineering source. https://github.com/video-dev/hls.js/issues/2468
- CTA-5005-A, DASH-HLS Interoperability Specification — single-encode interoperability between HLS and DASH. Industry standard (CTA WAVE). https://cdn.cta.tech/cta/media/media/resources/standards/cta-5005-a-final.pdf
- DASH-IF, Guidelines: Restricted Timing Model — the normative DASH timing model used to interpret period and segment timing. Industry guideline. https://dashif.org/Guidelines-TimingModel/


