Why this matters
If you build a streaming service, an OTT app, a recording feature, or anything that muxes audio and video into a file, the timestamp is the contract that keeps the two in sync. When a customer reports "the audio drifts" or "the sound is half a second behind in the recording," the answer is almost always a timestamp that was generated, copied, or interpreted incorrectly somewhere in the pipeline. A product manager needs to know this exists so they can ask the right question; an engineer needs to know exactly how it works so they can read a stream dump and find the bad number. This article gives both readers one mental model: the timestamp is a clock reading attached to every chunk of media, and synchronization is just arithmetic on those readings.
The one idea behind every timestamp
Start with the problem timestamps solve. A media file or stream is not one thing; it is at least two independent things — an audio track and a video track — that were captured by different hardware, compressed by different encoders, and now have to be played back as if they were one. The only way a player can recombine them correctly is if every piece carries a label saying exactly when, on a shared clock, this piece belongs. That label is the timestamp.
The label answers a single question: at what moment, measured against an agreed clock, should this chunk of media reach the viewer? A video frame labelled "1.000 seconds" and an audio chunk labelled "1.000 seconds" are meant to hit the eyes and ears together. The player's whole job, once the data has arrived, is to read those labels and release each chunk at the moment its label names. Synchronization, at this layer, is not magic — it is reading numbers and obeying them.
Before we can talk about the two kinds of timestamp, we need the unit they are measured in. The chunk of media a timestamp is attached to has a formal name: the access unit. An access unit is the smallest independently presentable piece of a stream — one compressed video frame, or one compressed block of audio samples. Timestamps are attached to access units, not to individual bytes. We cover how streams get chopped into these units in Frames, Packets, Granules: Why Audio Is Chunked; here we only need to know that each access unit is the thing a timestamp points at.
The 90 kHz clock: the heartbeat of MPEG timing
Every timestamp is a reading on a clock, so the first question is: what clock, ticking how fast? In the MPEG-2 systems world — the world of transport streams (used in broadcast, HLS, and .ts files) and program streams (used in DVDs) — the answer is fixed by the standard. The clock ticks 90,000 times per second. A timestamp is simply a count of how many of those ticks have elapsed.
Why 90,000 and not a round million? Because 90 kHz divides cleanly into the frame rates that mattered when MPEG-2 was designed. At 25 frames per second, one frame lasts 90,000 ÷ 25 = 3,600 ticks exactly. At 30 frames per second, one frame is 90,000 ÷ 30 = 3,000 ticks exactly. At 24 frames per second, 90,000 ÷ 24 = 3,750 ticks exactly. A clock that lands on a whole number of ticks for every common frame rate avoids rounding error accumulating frame after frame. That is the entire reason for the odd-looking number.
There is a deeper clock underneath. The master system clock in MPEG-2 actually runs at 27 MHz — 27 million ticks per second — and the 90 kHz timestamp clock is that master divided by exactly 300 (27,000,000 ÷ 300 = 90,000). The 27 MHz clock shows up in the program clock reference, the value broadcasters use to keep the decoder's clock locked to the encoder's; we cover that in PCR in MPEG-TS: The Program Clock That Runs the Broadcast. For the timestamps on individual frames, the 90 kHz number is the one you read and write.
The timestamp itself is a 33-bit number. Thirty-three bits can count up to 2³³ − 1 = 8,589,934,591 ticks. Divide by 90,000 ticks per second and you get the wraparound period:
maximum count = 2^33 − 1 = 8,589,934,591 ticks
clock rate = 90,000 ticks per second
wraparound time = 8,589,934,591 ÷ 90,000 ≈ 95,443 seconds ≈ 26.5 hours
So a 33-bit PTS counts cleanly for about 26.5 hours before it rolls over to zero and starts again. For a two-hour film this never matters; for a 24/7 broadcast channel it matters a great deal, and decoders are built to handle the wrap. This is the first "common mistake" trap, which we return to below.
Figure 2. The 90 kHz clock and its arithmetic. The master clock divided by 300 gives whole-number ticks per frame at every common rate — the reason the standard chose 90,000.
PTS: when to show the frame
The first of the two timestamps is the presentation timestamp, written PTS. It answers the simplest possible question: at what clock reading should this access unit be presented to the viewer — drawn on screen, or pushed to the speakers? A video frame with a PTS of 90,000 should appear exactly one second after the clock's zero point, because 90,000 ticks ÷ 90,000 ticks per second = 1 second.
For audio, PTS is almost the whole story. Audio is decoded in the same order it is played — the first block of samples is also the first block you hear — so for an audio track the presentation order and the decoding order are identical. An audio access unit needs to say "play me at this moment," and that is all. This is why pure-audio streams rarely carry a separate decoding timestamp: there is nothing to reorder.
A worked example makes PTS concrete. Suppose a 48 kHz audio stream is encoded by AAC, which packs 1,024 samples into each access unit (a frame, in AAC terms). How long does one AAC frame last, and what is the PTS spacing between consecutive frames?
frame duration in seconds = samples per frame ÷ sample rate
= 1,024 ÷ 48,000
= 0.021333… seconds (about 21.3 ms)
PTS increment per frame = frame duration × 90,000 ticks per second
= 0.021333… × 90,000
= 1,920 ticks exactly
So a stream of AAC frames at 48 kHz has PTS values 0, 1920, 3840, 5760, and so on — each frame's label is 1,920 ticks higher than the one before. If you ever dump the PTS values of a clean AAC stream and they step by exactly 1,920, the audio timing is healthy. If they step by something else, or jump unevenly, you have found a bug.
DTS: when to decode the frame — and why it differs
If presentation order and decoding order were always the same, one timestamp would be enough. They are not, and the reason is the way modern video compression works.
To compress video, encoders exploit the fact that consecutive frames look almost identical. Rather than store every frame in full, an encoder stores a few complete frames and describes the rest as differences from their neighbours. There are three frame types, and the names matter:
- An I-frame (intra-coded) is a complete picture, decodable on its own, referencing nothing.
- A P-frame (predicted) is described as the difference from a previous frame.
- A B-frame (bi-directionally predicted) is described as the difference from both a previous frame and a later one.
That last type is the whole problem. A B-frame depends on a frame that comes after it in display order. The decoder cannot reconstruct the B-frame until it has already decoded the future frame the B-frame refers to. So the encoder reorders the stream: it places the future reference frame earlier in the file than the B-frame that needs it, even though that reference frame is displayed later. The data arrives in decoding order; the viewer needs presentation order. The two orders differ.
This is exactly why two timestamps exist. The decoding timestamp, written DTS, says when the decoder should process an access unit. The presentation timestamp, PTS, says when the result should be shown. For an I-frame or P-frame, which never wait on a future frame, DTS and PTS are equal. For a B-frame's reference frames, DTS comes earlier than PTS, because the frame must be decoded ahead of its display moment to be ready when the B-frames need it.
Walk through a tiny group of pictures. Suppose the display order is I B B P — an I-frame, two B-frames, then a P-frame the B-frames depend on. Because the B-frames need the P-frame, the encoder must put the P-frame into the stream before the B-frames. The stored order on disk becomes I P B B. Now look at the two timestamps for each frame (in frame units for clarity; multiply by the per-frame tick count for real PTS values):
| Frame | Display position (PTS) | Decode position (DTS) | DTS = PTS? |
|---|---|---|---|
| I | 0 | 0 | yes |
| P | 3 | 1 | no — decoded early |
| B | 1 | 2 | no — decoded after its reference |
| B | 2 | 3 | no — decoded after its reference |
Table 1. A four-frame group with display order I B B P. The decoder receives them in DTS order (I, P, B, B) and reorders them to PTS order (I, B, B, P) before display. The P-frame is the giveaway: it is decoded at position 1 but shown at position 3.
Read the table slowly. The decoder pulls frames off the stream in DTS order — I, then P, then the two Bs. It holds the P-frame in a buffer, decodes the Bs using it, then releases all four to the screen in PTS order — I, B, B, P. The DTS keeps the decoder fed in the right sequence; the PTS keeps the picture correct on screen. Without DTS the decoder would try to build a B-frame before its reference existed, and fail.
Figure 1. The same four frames in decoding order (top, DTS sequence) and presentation order (bottom, PTS sequence). The crossing arrows are the reorder the decoder performs; the P-frame moving from second to last is why DTS and PTS must differ.
A clean rule to carry away: DTS ≤ PTS, always. A frame is never shown before it is decoded. For audio, and for video without B-frames, the two are equal. The gap between them only opens when B-frames force out-of-order decoding.
Where the timestamps physically live
The two-timestamp idea is universal, but the place the numbers are stored differs by container. The two worlds you will meet are MPEG-2 systems (transport and program streams) and the ISO base media file format (MP4 and its relatives).
In MPEG-2 streams: the PES header
In a transport stream or program stream, elementary streams are first wrapped in packetized elementary stream packets — PES packets. Each PES packet has a header, and the timestamps live in that header. A 2-bit field called PTS_DTS_flags tells the decoder what is present: the value 10 means only a PTS follows; the value 11 means both a PTS and a DTS follow. The value 00 means neither — common for the continuation of a large frame split across packets.
The 33-bit timestamp is not stored as a clean 33-bit field. The MPEG-2 systems standard spreads it across 5 bytes (40 bits), splitting the 33 data bits into three groups (3 bits, 15 bits, 15 bits) and inserting a 1 "marker bit" after each group plus a 4-bit prefix at the front. The prefix is 0010 when only PTS is present, 0011 for the PTS half of a PTS+DTS pair, and 0001 for the DTS half. The marker bits exist so that a decoder scanning a damaged stream can sanity-check that it is reading a real timestamp and not random bytes. You almost never decode this by hand — tools like ffprobe do it for you — but knowing the layout explains why a "33-bit" value occupies 5 bytes on the wire.
In MP4 / ISOBMFF: tables, not headers
MP4 takes a completely different approach, and this trips up engineers who learned timestamps on transport streams first. An MP4 file does not stamp each frame with an absolute PTS and DTS. Instead it stores tables in the sample table box, and the player computes the timestamps by walking the tables.
Two boxes carry the timing. The stts box ("decoding time to sample") lists the duration of each sample. The decoder starts at time zero and adds durations to compute each sample's DTS — so DTS is implicit, derived by accumulation rather than stored directly. The ctts box ("composition time to sample") stores, for each sample, the offset between its decoding time and its composition (presentation) time. That offset is exactly PTS − DTS. A file with no B-frames has no ctts box at all, because every offset would be zero. A file with B-frames carries a ctts box whose non-zero entries are the same reorder gap we saw in Table 1, expressed per sample.
There is one more box that quietly rewrites timing: the edit list, the elst box. It maps the media's internal timeline onto the presentation timeline, and it is how an MP4 expresses things like "start playback 40 milliseconds into the audio track" or "delay the video by two frames." A surprising share of real-world sync bugs come from an edit list that one tool wrote and another tool ignored. We flag this as a pitfall below.
| Container | Timestamps stored as | DTS source | PTS source | B-frame marker |
|---|---|---|---|---|
| MPEG-2 TS / PS | Explicit values in the PES header | Read directly (if PTS_DTS_flags = 11) |
Read directly | Both PTS and DTS present |
| MP4 / ISOBMFF | Tables in the sample table box | Accumulated from stts durations |
DTS + ctts offset |
ctts box present with non-zero offsets |
Table 2. Where PTS and DTS live in the two container worlds. The numbers carry the same meaning; only the storage and the retrieval method differ. The carriage details for each container are covered in Audio in Containers: How MP4, MKV, fMP4, MPEG-TS Carry Audio.
Figure 3. The same two timestamps stored two ways: explicit values in the MPEG-2 PES header, computed tables in MP4. Reading MP4 timing means walking the stts, ctts, and edit-list boxes against the track's timescale.
The timescale: MP4's flexible clock
The 90 kHz clock is fixed in MPEG-2, but MP4 lets each track declare its own clock rate, called the timescale. The timescale is the number of ticks per second for that track, written into the track's media header. An audio track sampled at 48 kHz commonly uses a timescale of 48,000, so one tick equals one audio sample and durations are whole numbers. A video track might use 30,000 (or the broadcast-friendly 30,000 with sample durations of 1,001 to express the 29.97 fps rate cleanly).
The timescale is why you cannot read an MP4 timestamp without first reading the track's clock. A stts duration of 1,024 means 1,024 ticks, and what that is in seconds depends entirely on the timescale. At a timescale of 48,000, 1,024 ticks is the 21.3-millisecond AAC frame from earlier; at a timescale of 90,000 the same number would mean something different. Always divide by the timescale to get seconds:
time in seconds = tick value ÷ timescale
example: stts duration 1,024 at timescale 48,000
= 1,024 ÷ 48,000
= 0.021333… seconds (the 21.3 ms AAC frame)
This per-track flexibility is powerful and is the source of a second common mistake: assuming the video and audio tracks share a timescale. They usually do not, and any code that compares a raw video tick to a raw audio tick without first converting both to seconds will compute nonsense.
Common mistakes that cause real sync bugs
Timestamps are simple arithmetic, but the failure modes are specific and recur across almost every team that touches media. Four are worth naming.
Treating PTS as wall-clock time. The most common conceptual error. A PTS of 90,000 does not mean "9 a.m." or "90 seconds after the user pressed play." It is a count on the stream's own clock, whose zero point is arbitrary — many encoders start the first PTS at some non-zero value. PTS tells you the relationship between frames, not the time of day. To map it to wall-clock time you need an external anchor (the program clock reference in broadcast, or the RTCP sender report in WebRTC), which is a separate mechanism covered in RTP Timestamps, RTCP Sender Reports, and NTP Synchronization.
Ignoring the 26.5-hour wraparound. Because the 33-bit clock rolls over after about 26.5 hours, a naive comparison if (pts_a > pts_b) breaks the instant the counter wraps: a frame just after the wrap has a tiny PTS, yet it is later than a frame just before with a huge PTS. Long-running channels and recordings must detect the wrap and account for it, or they will reorder or drop frames around the rollover point.
Dropping the ctts box or the edit list when remuxing. Tools that copy a stream from one container to another sometimes preserve the stts durations but discard the ctts composition offsets or the elst edit list. The result is video whose B-frames are now presented in decoding order — every other frame visibly out of place — or audio that is offset by the exact amount the edit list was meant to correct. When a remux introduces a sync error that was not in the source, a discarded ctts or elst is the first suspect.
Comparing ticks across mismatched clocks. As noted above, an audio track at timescale 48,000 and a video track at timescale 30,000 cannot have their raw tick values compared. Convert both to seconds first. Skipping the conversion produces a sync offset that scales with playback position — small at the start, large near the end — which is the signature of a timescale mismatch rather than a fixed delay.
A worked example: checking sync in a real file
Put the pieces together. Suppose ffprobe reports a video track at timescale 90,000 and an audio track at timescale 48,000, and you want to know whether the first audio frame and the first video frame are meant to play together.
Video: first frame DTS = 0, PTS = 3,750 ticks (timescale 90,000)
Audio: first frame PTS = 0, no DTS (timescale 48,000)
Video first-frame PTS in seconds = 3,750 ÷ 90,000 = 0.041667 s (≈ 41.7 ms)
Audio first-frame PTS in seconds = 0 ÷ 48,000 = 0.000000 s
The video's first presented frame is 41.7 milliseconds after the audio's. The DTS of 0 with a PTS of 3,750 tells you this video has B-frames: the first frame in decoding order is a reference frame shown one frame later (3,750 ticks = exactly one frame at 90,000 ÷ 24 = 3,750, so this is 24 fps content). An edit list might intentionally trim that 41.7 ms to bring the two to zero — which is why you check the elst box before concluding the file is out of sync. The offset here is well inside the lip-sync tolerance window anyway; see What "In Sync" Means: ITU-R BT.1359, Lip-Sync Windows, Perceptibility for why 41.7 ms of video lag is imperceptible.
The point of the exercise is the method, not the numbers: read each track's timescale, convert every tick to seconds, account for DTS-versus-PTS reordering, then check the edit list before you blame the encoder. That sequence finds almost every recorded-media sync bug.
Where Fora Soft fits in
We have built recording, transcoding, and streaming features across video conferencing, OTT, e-learning, telemedicine, and surveillance since 2005, and timestamp handling is where recorded-media sync is won or lost. The recurring real-world problem is not exotic: a recording pipeline that muxes a WebRTC call into MP4 has to translate RTP timestamps into stts durations and ctts offsets correctly, and a single dropped edit list or mismatched timescale shows up later as audio drifting away from video over a long recording. When a client reports that a saved session is out of sync while the live call was fine, the timestamp translation step is the first place we look, because that is the seam where two clock systems meet.
What to read next
- PCR in MPEG-TS: The Program Clock That Runs the Broadcast
- RTP Timestamps, RTCP Sender Reports, and NTP Synchronization
- Audio in Containers: How MP4, MKV, fMP4, MPEG-TS Carry Audio
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your pts dts explained plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the PTS / DTS Reference Card — One page: the PTS-vs-DTS sign rule, the 90 kHz tick math for 24/25/30 fps and AAC, and the MPEG-2 PES vs MP4 stts/ctts/elst storage map.
References
- ISO/IEC 13818-1:2022, "Information technology — Generic coding of moving pictures and associated audio information — Part 1: Systems" (MPEG-2 Systems). Defines the PES packet, the
PTS_DTS_flagsfield, the 33-bit PTS and DTS at 90 kHz, the 5-byte timestamp encoding with marker bits, and the 27 MHz system clock. The controlling standard for transport-stream and program-stream timing. https://www.iso.org/standard/83239.html — Tier 1 (official ISO/IEC standard). - ISO/IEC 14496-12:2022, "Information technology — Coding of audio-visual objects — Part 12: ISO base media file format" (ISOBMFF). Defines the
stts(decoding time to sample),ctts(composition time to sample), andelst(edit list) boxes, the per-track timescale in the media header, and the rule that composition time = decoding time + composition offset. The controlling standard for MP4 timing. https://www.iso.org/standard/83102.html — Tier 1 (official ISO/IEC standard). - ITU-T H.222.0 (2021), "Generic coding of moving pictures and associated audio information: Systems." The ITU-T twin of ISO/IEC 13818-1; identical normative content for PES headers, PTS/DTS encoding, and the 90 kHz/27 MHz clocks. https://www.itu.int/rec/T-REC-H.222.0 — Tier 1 (official ITU-T Recommendation; twin text to Ref. 1).
- ITU-T H.264 / ISO/IEC 14496-10, "Advanced video coding for generic audiovisual services." Defines I-, P-, and B-frame (bi-predictive) coding and the reference-frame dependencies that force decoding order to differ from display order — the root cause of DTS ≠ PTS. https://www.itu.int/rec/T-REC-H.264 — Tier 1 (official ITU-T Recommendation).
- ISO/IEC 14496-3:2019, "Coding of audio-visual objects — Part 3: Audio" (AAC). Source for the 1,024-samples-per-AAC-frame fact used in the PTS-increment worked example. https://www.iso.org/standard/76383.html — Tier 1 (official ISO/IEC standard).
- FFmpeg documentation —
ffprobeand timestamps. Reference implementation behaviour for reading PTS, DTS, and timescale from real files; the practical tool the article assumes the reader uses. https://ffmpeg.org/ffprobe.html — Tier 2 (reference-implementation documentation; subordinate to the standards above for any normative claim). - MPEG / Leonardo Chiariglione, "An Overview of the ISO Base Media File Format." First-party MPEG explainer of the box hierarchy and the timing model, used for orientation and confirmed against the normative text of Ref. 2. https://mpeg.chiariglione.org/standards/mpeg-4/iso-base-media-file-format — Tier 3 (first-party explainer; subordinate to Ref. 2).
- Apple, "QuickTime File Format Specification" — Edit Lists and Sample Tables. Practitioner-grade description of
stts,ctts, and edit-list behaviour as implemented in the format MP4 derives from; used for the edit-list pitfall, cross-checked against Ref. 2. https://developer.apple.com/documentation/quicktime-file-format — Tier 4 (vendor documentation; the article defers to ISO/IEC 14496-12 where they differ).
Note on a corrected oversimplification: many online explainers state that "PTS is the time the frame plays" and stop there, implying PTS is wall-clock time. This article follows ISO/IEC 13818-1, in which PTS is a count on the stream's own 90 kHz clock with an arbitrary zero point; mapping it to wall-clock time requires a separate reference clock (PCR or RTCP), and the article flags the wall-clock framing as the single most common conceptual error.


