Why this matters

If you build a streaming service, an OTT app, a recording feature, or anything that muxes audio and video into a file, the timestamp is the contract that keeps the two in sync. When a customer reports "the audio drifts" or "the sound is half a second behind in the recording," the answer is almost always a timestamp that was generated, copied, or interpreted incorrectly somewhere in the pipeline. A product manager needs to know this exists so they can ask the right question; an engineer needs to know exactly how it works so they can read a stream dump and find the bad number. This article gives both readers one mental model: the timestamp is a clock reading attached to every chunk of media, and synchronization is just arithmetic on those readings.


The one idea behind every timestamp

Start with the problem timestamps solve. A media file or stream is not one thing; it is at least two independent things — an audio track and a video track — that were captured by different hardware, compressed by different encoders, and now have to be played back as if they were one. The only way a player can recombine them correctly is if every piece carries a label saying exactly when, on a shared clock, this piece belongs. That label is the timestamp.

The label answers a single question: at what moment, measured against an agreed clock, should this chunk of media reach the viewer? A video frame labelled "1.000 seconds" and an audio chunk labelled "1.000 seconds" are meant to hit the eyes and ears together. The player's whole job, once the data has arrived, is to read those labels and release each chunk at the moment its label names. Synchronization, at this layer, is not magic — it is reading numbers and obeying them.

Before we can talk about the two kinds of timestamp, we need the unit they are measured in. The chunk of media a timestamp is attached to has a formal name: the access unit. An access unit is the smallest independently presentable piece of a stream — one compressed video frame, or one compressed block of audio samples. Timestamps are attached to access units, not to individual bytes. We cover how streams get chopped into these units in Frames, Packets, Granules: Why Audio Is Chunked; here we only need to know that each access unit is the thing a timestamp points at.


The 90 kHz clock: the heartbeat of MPEG timing

Every timestamp is a reading on a clock, so the first question is: what clock, ticking how fast? In the MPEG-2 systems world — the world of transport streams (used in broadcast, HLS, and .ts files) and program streams (used in DVDs) — the answer is fixed by the standard. The clock ticks 90,000 times per second. A timestamp is simply a count of how many of those ticks have elapsed.

Why 90,000 and not a round million? Because 90 kHz divides cleanly into the frame rates that mattered when MPEG-2 was designed. At 25 frames per second, one frame lasts 90,000 ÷ 25 = 3,600 ticks exactly. At 30 frames per second, one frame is 90,000 ÷ 30 = 3,000 ticks exactly. At 24 frames per second, 90,000 ÷ 24 = 3,750 ticks exactly. A clock that lands on a whole number of ticks for every common frame rate avoids rounding error accumulating frame after frame. That is the entire reason for the odd-looking number.

There is a deeper clock underneath. The master system clock in MPEG-2 actually runs at 27 MHz — 27 million ticks per second — and the 90 kHz timestamp clock is that master divided by exactly 300 (27,000,000 ÷ 300 = 90,000). The 27 MHz clock shows up in the program clock reference, the value broadcasters use to keep the decoder's clock locked to the encoder's; we cover that in PCR in MPEG-TS: The Program Clock That Runs the Broadcast. For the timestamps on individual frames, the 90 kHz number is the one you read and write.

The timestamp itself is a 33-bit number. Thirty-three bits can count up to 2³³ − 1 = 8,589,934,591 ticks. Divide by 90,000 ticks per second and you get the wraparound period:

maximum count   = 2^33 − 1      = 8,589,934,591 ticks
clock rate       = 90,000 ticks per second
wraparound time  = 8,589,934,591 ÷ 90,000 ≈ 95,443 seconds ≈ 26.5 hours

So a 33-bit PTS counts cleanly for about 26.5 hours before it rolls over to zero and starts again. For a two-hour film this never matters; for a 24/7 broadcast channel it matters a great deal, and decoders are built to handle the wrap. This is the first "common mistake" trap, which we return to below.

A two-panel diagram of the MPEG timing clock. The left panel shows the derivation: a 27 MHz master system clock box, an arrow labelled divide by 300, and a 90 kHz timestamp clock box, with a note that a PTS or DTS is a 33-bit count of 90,000 ticks per second, and an orange wraparound box stating that 2 to the 33 ticks divided by 90,000 is about 26.5 hours before the counter rolls to zero. The right panel is a table titled one frame in ticks listing 24 fps equals 3,750 ticks, 25 fps equals 3,600 ticks, and 30 fps equals 3,000 ticks, plus an AAC example showing 1,024 samples divided by 48,000 times 90,000 equals 1,920 ticks per frame. Figure 2. The 90 kHz clock and its arithmetic. The master clock divided by 300 gives whole-number ticks per frame at every common rate — the reason the standard chose 90,000.


PTS: when to show the frame

The first of the two timestamps is the presentation timestamp, written PTS. It answers the simplest possible question: at what clock reading should this access unit be presented to the viewer — drawn on screen, or pushed to the speakers? A video frame with a PTS of 90,000 should appear exactly one second after the clock's zero point, because 90,000 ticks ÷ 90,000 ticks per second = 1 second.

For audio, PTS is almost the whole story. Audio is decoded in the same order it is played — the first block of samples is also the first block you hear — so for an audio track the presentation order and the decoding order are identical. An audio access unit needs to say "play me at this moment," and that is all. This is why pure-audio streams rarely carry a separate decoding timestamp: there is nothing to reorder.

A worked example makes PTS concrete. Suppose a 48 kHz audio stream is encoded by AAC, which packs 1,024 samples into each access unit (a frame, in AAC terms). How long does one AAC frame last, and what is the PTS spacing between consecutive frames?

frame duration in seconds = samples per frame ÷ sample rate
                          = 1,024 ÷ 48,000
                          = 0.021333… seconds   (about 21.3 ms)

PTS increment per frame   = frame duration × 90,000 ticks per second
                          = 0.021333… × 90,000
                          = 1,920 ticks exactly

So a stream of AAC frames at 48 kHz has PTS values 0, 1920, 3840, 5760, and so on — each frame's label is 1,920 ticks higher than the one before. If you ever dump the PTS values of a clean AAC stream and they step by exactly 1,920, the audio timing is healthy. If they step by something else, or jump unevenly, you have found a bug.


DTS: when to decode the frame — and why it differs

If presentation order and decoding order were always the same, one timestamp would be enough. They are not, and the reason is the way modern video compression works.

To compress video, encoders exploit the fact that consecutive frames look almost identical. Rather than store every frame in full, an encoder stores a few complete frames and describes the rest as differences from their neighbours. There are three frame types, and the names matter:

  • An I-frame (intra-coded) is a complete picture, decodable on its own, referencing nothing.
  • A P-frame (predicted) is described as the difference from a previous frame.
  • A B-frame (bi-directionally predicted) is described as the difference from both a previous frame and a later one.

That last type is the whole problem. A B-frame depends on a frame that comes after it in display order. The decoder cannot reconstruct the B-frame until it has already decoded the future frame the B-frame refers to. So the encoder reorders the stream: it places the future reference frame earlier in the file than the B-frame that needs it, even though that reference frame is displayed later. The data arrives in decoding order; the viewer needs presentation order. The two orders differ.

This is exactly why two timestamps exist. The decoding timestamp, written DTS, says when the decoder should process an access unit. The presentation timestamp, PTS, says when the result should be shown. For an I-frame or P-frame, which never wait on a future frame, DTS and PTS are equal. For a B-frame's reference frames, DTS comes earlier than PTS, because the frame must be decoded ahead of its display moment to be ready when the B-frames need it.

Walk through a tiny group of pictures. Suppose the display order is I B B P — an I-frame, two B-frames, then a P-frame the B-frames depend on. Because the B-frames need the P-frame, the encoder must put the P-frame into the stream before the B-frames. The stored order on disk becomes I P B B. Now look at the two timestamps for each frame (in frame units for clarity; multiply by the per-frame tick count for real PTS values):

Frame Display position (PTS) Decode position (DTS) DTS = PTS?
I 0 0 yes
P 3 1 no — decoded early
B 1 2 no — decoded after its reference
B 2 3 no — decoded after its reference

Table 1. A four-frame group with display order I B B P. The decoder receives them in DTS order (I, P, B, B) and reorders them to PTS order (I, B, B, P) before display. The P-frame is the giveaway: it is decoded at position 1 but shown at position 3.

Read the table slowly. The decoder pulls frames off the stream in DTS order — I, then P, then the two Bs. It holds the P-frame in a buffer, decodes the Bs using it, then releases all four to the screen in PTS order — I, B, B, P. The DTS keeps the decoder fed in the right sequence; the PTS keeps the picture correct on screen. Without DTS the decoder would try to build a B-frame before its reference existed, and fail.

A diagram contrasting two horizontal rows of four frames each. The top row, labelled decoding order, shows the frames in the sequence I, P, B, B, with their decode timestamps 0, 1, 2, 3 beneath. The bottom row, labelled presentation order, shows the same frames rearranged into the sequence I, B, B, P, with their presentation timestamps 0, 1, 2, 3 beneath. Curved arrows connect each frame in the top row to its position in the bottom row, crossing over to show the reordering. The P-frame is highlighted to show it is decoded second but presented last. Figure 1. The same four frames in decoding order (top, DTS sequence) and presentation order (bottom, PTS sequence). The crossing arrows are the reorder the decoder performs; the P-frame moving from second to last is why DTS and PTS must differ.

A clean rule to carry away: DTS ≤ PTS, always. A frame is never shown before it is decoded. For audio, and for video without B-frames, the two are equal. The gap between them only opens when B-frames force out-of-order decoding.


Where the timestamps physically live

The two-timestamp idea is universal, but the place the numbers are stored differs by container. The two worlds you will meet are MPEG-2 systems (transport and program streams) and the ISO base media file format (MP4 and its relatives).

In MPEG-2 streams: the PES header

In a transport stream or program stream, elementary streams are first wrapped in packetized elementary stream packets — PES packets. Each PES packet has a header, and the timestamps live in that header. A 2-bit field called PTS_DTS_flags tells the decoder what is present: the value 10 means only a PTS follows; the value 11 means both a PTS and a DTS follow. The value 00 means neither — common for the continuation of a large frame split across packets.

The 33-bit timestamp is not stored as a clean 33-bit field. The MPEG-2 systems standard spreads it across 5 bytes (40 bits), splitting the 33 data bits into three groups (3 bits, 15 bits, 15 bits) and inserting a 1 "marker bit" after each group plus a 4-bit prefix at the front. The prefix is 0010 when only PTS is present, 0011 for the PTS half of a PTS+DTS pair, and 0001 for the DTS half. The marker bits exist so that a decoder scanning a damaged stream can sanity-check that it is reading a real timestamp and not random bytes. You almost never decode this by hand — tools like ffprobe do it for you — but knowing the layout explains why a "33-bit" value occupies 5 bytes on the wire.

In MP4 / ISOBMFF: tables, not headers

MP4 takes a completely different approach, and this trips up engineers who learned timestamps on transport streams first. An MP4 file does not stamp each frame with an absolute PTS and DTS. Instead it stores tables in the sample table box, and the player computes the timestamps by walking the tables.

Two boxes carry the timing. The stts box ("decoding time to sample") lists the duration of each sample. The decoder starts at time zero and adds durations to compute each sample's DTS — so DTS is implicit, derived by accumulation rather than stored directly. The ctts box ("composition time to sample") stores, for each sample, the offset between its decoding time and its composition (presentation) time. That offset is exactly PTS − DTS. A file with no B-frames has no ctts box at all, because every offset would be zero. A file with B-frames carries a ctts box whose non-zero entries are the same reorder gap we saw in Table 1, expressed per sample.

There is one more box that quietly rewrites timing: the edit list, the elst box. It maps the media's internal timeline onto the presentation timeline, and it is how an MP4 expresses things like "start playback 40 milliseconds into the audio track" or "delay the video by two frames." A surprising share of real-world sync bugs come from an edit list that one tool wrote and another tool ignored. We flag this as a pitfall below.

Container Timestamps stored as DTS source PTS source B-frame marker
MPEG-2 TS / PS Explicit values in the PES header Read directly (if PTS_DTS_flags = 11) Read directly Both PTS and DTS present
MP4 / ISOBMFF Tables in the sample table box Accumulated from stts durations DTS + ctts offset ctts box present with non-zero offsets

Table 2. Where PTS and DTS live in the two container worlds. The numbers carry the same meaning; only the storage and the retrieval method differ. The carriage details for each container are covered in Audio in Containers: How MP4, MKV, fMP4, MPEG-TS Carry Audio.

A two-column comparison of where PTS and DTS are stored. The left column, MPEG-2 TS slash PS, shows a PES header with three stacked rows: a PTS_DTS_flags field (10 equals PTS, 11 equals PTS plus DTS), a 33-bit 90 kHz PTS spread across 5 bytes, and a 33-bit 90 kHz DTS present only with B-frames, with a note that marker bits split the 33 data bits into 3 plus 15 plus 15. The right column, MP4 slash ISOBMFF, shows three stacked sample-table boxes: stts giving per-sample durations that accumulate into DTS, ctts giving the composition offset that produces PTS and is absent without B-frames, and an orange elst edit-list box mapping the media timeline to the presentation timeline, with a note that each track has its own timescale. A centre caption reads same meaning, different storage and retrieval. Figure 3. The same two timestamps stored two ways: explicit values in the MPEG-2 PES header, computed tables in MP4. Reading MP4 timing means walking the stts, ctts, and edit-list boxes against the track's timescale.


The timescale: MP4's flexible clock

The 90 kHz clock is fixed in MPEG-2, but MP4 lets each track declare its own clock rate, called the timescale. The timescale is the number of ticks per second for that track, written into the track's media header. An audio track sampled at 48 kHz commonly uses a timescale of 48,000, so one tick equals one audio sample and durations are whole numbers. A video track might use 30,000 (or the broadcast-friendly 30,000 with sample durations of 1,001 to express the 29.97 fps rate cleanly).

The timescale is why you cannot read an MP4 timestamp without first reading the track's clock. A stts duration of 1,024 means 1,024 ticks, and what that is in seconds depends entirely on the timescale. At a timescale of 48,000, 1,024 ticks is the 21.3-millisecond AAC frame from earlier; at a timescale of 90,000 the same number would mean something different. Always divide by the timescale to get seconds:

time in seconds = tick value ÷ timescale

example: stts duration 1,024 at timescale 48,000
       = 1,024 ÷ 48,000
       = 0.021333… seconds  (the 21.3 ms AAC frame)

This per-track flexibility is powerful and is the source of a second common mistake: assuming the video and audio tracks share a timescale. They usually do not, and any code that compares a raw video tick to a raw audio tick without first converting both to seconds will compute nonsense.


Common mistakes that cause real sync bugs

Timestamps are simple arithmetic, but the failure modes are specific and recur across almost every team that touches media. Four are worth naming.

Treating PTS as wall-clock time. The most common conceptual error. A PTS of 90,000 does not mean "9 a.m." or "90 seconds after the user pressed play." It is a count on the stream's own clock, whose zero point is arbitrary — many encoders start the first PTS at some non-zero value. PTS tells you the relationship between frames, not the time of day. To map it to wall-clock time you need an external anchor (the program clock reference in broadcast, or the RTCP sender report in WebRTC), which is a separate mechanism covered in RTP Timestamps, RTCP Sender Reports, and NTP Synchronization.

Ignoring the 26.5-hour wraparound. Because the 33-bit clock rolls over after about 26.5 hours, a naive comparison if (pts_a > pts_b) breaks the instant the counter wraps: a frame just after the wrap has a tiny PTS, yet it is later than a frame just before with a huge PTS. Long-running channels and recordings must detect the wrap and account for it, or they will reorder or drop frames around the rollover point.

Dropping the ctts box or the edit list when remuxing. Tools that copy a stream from one container to another sometimes preserve the stts durations but discard the ctts composition offsets or the elst edit list. The result is video whose B-frames are now presented in decoding order — every other frame visibly out of place — or audio that is offset by the exact amount the edit list was meant to correct. When a remux introduces a sync error that was not in the source, a discarded ctts or elst is the first suspect.

Comparing ticks across mismatched clocks. As noted above, an audio track at timescale 48,000 and a video track at timescale 30,000 cannot have their raw tick values compared. Convert both to seconds first. Skipping the conversion produces a sync offset that scales with playback position — small at the start, large near the end — which is the signature of a timescale mismatch rather than a fixed delay.


A worked example: checking sync in a real file

Put the pieces together. Suppose ffprobe reports a video track at timescale 90,000 and an audio track at timescale 48,000, and you want to know whether the first audio frame and the first video frame are meant to play together.

Video: first frame DTS = 0,     PTS = 3,750 ticks  (timescale 90,000)
Audio: first frame PTS = 0,     no DTS              (timescale 48,000)

Video first-frame PTS in seconds = 3,750 ÷ 90,000 = 0.041667 s  (≈ 41.7 ms)
Audio first-frame PTS in seconds = 0     ÷ 48,000 = 0.000000 s

The video's first presented frame is 41.7 milliseconds after the audio's. The DTS of 0 with a PTS of 3,750 tells you this video has B-frames: the first frame in decoding order is a reference frame shown one frame later (3,750 ticks = exactly one frame at 90,000 ÷ 24 = 3,750, so this is 24 fps content). An edit list might intentionally trim that 41.7 ms to bring the two to zero — which is why you check the elst box before concluding the file is out of sync. The offset here is well inside the lip-sync tolerance window anyway; see What "In Sync" Means: ITU-R BT.1359, Lip-Sync Windows, Perceptibility for why 41.7 ms of video lag is imperceptible.

The point of the exercise is the method, not the numbers: read each track's timescale, convert every tick to seconds, account for DTS-versus-PTS reordering, then check the edit list before you blame the encoder. That sequence finds almost every recorded-media sync bug.


Where Fora Soft fits in

We have built recording, transcoding, and streaming features across video conferencing, OTT, e-learning, telemedicine, and surveillance since 2005, and timestamp handling is where recorded-media sync is won or lost. The recurring real-world problem is not exotic: a recording pipeline that muxes a WebRTC call into MP4 has to translate RTP timestamps into stts durations and ctts offsets correctly, and a single dropped edit list or mismatched timescale shows up later as audio drifting away from video over a long recording. When a client reports that a saved session is out of sync while the live call was fine, the timestamp translation step is the first place we look, because that is the seam where two clock systems meet.


What to read next

Call to action

References

  1. ISO/IEC 13818-1:2022, "Information technology — Generic coding of moving pictures and associated audio information — Part 1: Systems" (MPEG-2 Systems). Defines the PES packet, the PTS_DTS_flags field, the 33-bit PTS and DTS at 90 kHz, the 5-byte timestamp encoding with marker bits, and the 27 MHz system clock. The controlling standard for transport-stream and program-stream timing. https://www.iso.org/standard/83239.htmlTier 1 (official ISO/IEC standard).
  2. ISO/IEC 14496-12:2022, "Information technology — Coding of audio-visual objects — Part 12: ISO base media file format" (ISOBMFF). Defines the stts (decoding time to sample), ctts (composition time to sample), and elst (edit list) boxes, the per-track timescale in the media header, and the rule that composition time = decoding time + composition offset. The controlling standard for MP4 timing. https://www.iso.org/standard/83102.htmlTier 1 (official ISO/IEC standard).
  3. ITU-T H.222.0 (2021), "Generic coding of moving pictures and associated audio information: Systems." The ITU-T twin of ISO/IEC 13818-1; identical normative content for PES headers, PTS/DTS encoding, and the 90 kHz/27 MHz clocks. https://www.itu.int/rec/T-REC-H.222.0Tier 1 (official ITU-T Recommendation; twin text to Ref. 1).
  4. ITU-T H.264 / ISO/IEC 14496-10, "Advanced video coding for generic audiovisual services." Defines I-, P-, and B-frame (bi-predictive) coding and the reference-frame dependencies that force decoding order to differ from display order — the root cause of DTS ≠ PTS. https://www.itu.int/rec/T-REC-H.264Tier 1 (official ITU-T Recommendation).
  5. ISO/IEC 14496-3:2019, "Coding of audio-visual objects — Part 3: Audio" (AAC). Source for the 1,024-samples-per-AAC-frame fact used in the PTS-increment worked example. https://www.iso.org/standard/76383.htmlTier 1 (official ISO/IEC standard).
  6. FFmpeg documentation — ffprobe and timestamps. Reference implementation behaviour for reading PTS, DTS, and timescale from real files; the practical tool the article assumes the reader uses. https://ffmpeg.org/ffprobe.htmlTier 2 (reference-implementation documentation; subordinate to the standards above for any normative claim).
  7. MPEG / Leonardo Chiariglione, "An Overview of the ISO Base Media File Format." First-party MPEG explainer of the box hierarchy and the timing model, used for orientation and confirmed against the normative text of Ref. 2. https://mpeg.chiariglione.org/standards/mpeg-4/iso-base-media-file-formatTier 3 (first-party explainer; subordinate to Ref. 2).
  8. Apple, "QuickTime File Format Specification" — Edit Lists and Sample Tables. Practitioner-grade description of stts, ctts, and edit-list behaviour as implemented in the format MP4 derives from; used for the edit-list pitfall, cross-checked against Ref. 2. https://developer.apple.com/documentation/quicktime-file-formatTier 4 (vendor documentation; the article defers to ISO/IEC 14496-12 where they differ).

Note on a corrected oversimplification: many online explainers state that "PTS is the time the frame plays" and stop there, implying PTS is wall-clock time. This article follows ISO/IEC 13818-1, in which PTS is a count on the stream's own 90 kHz clock with an arbitrary zero point; mapping it to wall-clock time requires a separate reference clock (PCR or RTCP), and the article flags the wall-clock framing as the single most common conceptual error.