Why this matters

If you build video products — conferencing, streaming, OTT, surveillance, telemedicine — the question "why is the audio out of sync?" lands on your desk eventually, and the honest answer is almost never "one bug". It is a clock somewhere in a chain of seven clocks that handed off a wrong or missing timestamp. This article is the map. A product manager can read it to understand what their engineers are arguing about; an engineer can use it to find which hop is dropping the ball. The whole point of Block 5 is that nobody else on the internet draws this diagram in full — so we did.

A horizontal end-to-end pipeline diagram showing audio and video flowing left to right through seven labelled stages: capture, encoder, muxer or packetizer, network, demuxer or jitter buffer, decoder, and renderer. Each stage sits inside one of four colour-coded clock domains: the capture sample clock, the encoder media clock, the transport clock (27 MHz program clock for broadcast or NTP wall clock for WebRTC), and the renderer playback clock. Arrows mark where each timestamp is written and where drift accumulates between clock domains, with correction points highlighted at the jitter buffer and the renderer. Figure 1. The full capture-to-render pipeline. Every box lives in a clock domain; every arrow that crosses a domain boundary is a place sync can break.

The core idea: a timestamp is a tick count, not a time of day

Before any diagram makes sense, fix one idea in your head. A media timestamp does not tell you what time it is. It tells you how many ticks of some clock have passed since that clock's own starting point. The starting point is often a random number chosen at stream startup. The clock runs at a fixed rate — say 48,000 ticks per second for Opus audio, or 90,000 ticks per second for video.

So when a packet says its timestamp is 1,440,000, that is not 1:44 in the afternoon. For a 48 kHz audio clock it means "1,440,000 sampling instants have elapsed since this stream's zero", which is exactly 30 seconds of audio (1,440,000 ÷ 48,000 = 30). The number is meaningless until you know two things: the clock rate, and what real-world instant the stream's zero corresponds to.

That second fact — anchoring the tick count to a real instant — is the entire job of synchronization. Audio and video are two separate tick counters with two different rates and two different random starts. Lip-sync is the act of placing both counters on one shared real-world timeline so a sample and a frame that happened at the same instant get played at the same instant.

The seven stages, and the clock in each one

Walk the pipeline left to right. At every stage, ask one question: which clock is in charge here, and what timestamp does it write?

Stage 1 — Capture: the sample clock is born

A microphone produces a continuous voltage. An analog-to-digital converter measures that voltage at a fixed rate — 48,000 times per second is the standard for video work. That rate, called the sample rate, is governed by a tiny crystal oscillator on the capture hardware. This is the first clock in the chain, and it is the master timekeeper for the audio: every later stage inherits its sense of "how much audio time has passed" from how many samples this converter produced.

The camera has its own capture clock, ticking at the frame rate — 30 or 60 frames per second. Already, at stage one, you have two independent clocks: an audio sample clock and a video frame clock. They are not the same crystal, and they will not stay perfectly aligned. Hold that thought; it is the seed of all drift.

Stage 2 — Encoder: ticks become media timestamps

The encoder (Opus, AAC, H.264, AV1) takes raw samples or frames and compresses them into packets. As it does, it stamps each packet with a media timestamp — a tick count on the media clock. For audio the media clock usually equals the sample rate (48 kHz for Opus). For video, by long-standing convention, the media clock runs at 90,000 Hz regardless of frame rate, because 90 kHz divides evenly into every common frame rate (90,000 ÷ 30 = 3,000 ticks per frame; ÷ 25 = 3,600; ÷ 24 = 3,750).

This is also where a subtle complication enters for video: the order packets are decoded is not always the order they are displayed. Modern video uses bidirectional frames (B-frames) that depend on a future frame, so the encoder writes two timestamps per video packet — a Decode Time Stamp (DTS) saying "decode me now" and a Presentation Time Stamp (PTS) saying "show me at this instant". Audio has no such reordering, so audio packets carry only a presentation timestamp. The full story of PTS and DTS lives in its own article; here, just note that the encoder is where both are born.

Stage 3 — Muxer or packetizer: the transport clock takes over

Now the streams must be combined for transport, and this is where the broadcast world and the real-time world split into two completely different designs.

In the broadcast and file world (MPEG-TS, MP4), a muxer interleaves audio and video packets into one container and references every PTS and DTS to a single master clock called the System Time Clock (STC). For MPEG-TS this clock runs at 27 MHz, and the muxer periodically inserts a sample of it — the Program Clock Reference (PCR) — into the stream so the receiver can rebuild the exact same clock. The 27 MHz STC is split internally: a 33-bit counter at 90 kHz (the same 90 kHz the PTS/DTS use) plus a 9-bit extension that counts the leftover ticks modulo 300, since 27,000,000 ÷ 90,000 = 300. One master clock governs the whole program.

In the real-time world (WebRTC), there is no shared muxer clock. Audio and video are sent as two independent RTP streams, each with its own media timestamp and its own random start. The packetizer just wraps each media packet in an RTP header and sends it. The shared timeline is established separately, by the control protocol, which we reach in stage 5.

Stage 4 — Network: the stage with no clock at all

The network does not have a clock. It has delay, and the delay is not constant. One packet crosses in 20 milliseconds; the next takes 45; a third takes 18. This variation in arrival time is called jitter, and it is the reason the pipeline cannot simply play packets the moment they arrive. The network also reorders packets and loses some entirely.

Crucially, the network does not touch timestamps. The timestamp a packet carries out of the muxer is the same one it carries into the demuxer. That is the whole genius of timestamping: by writing the timing at the source and reading it at the sink, the unreliable middle is made irrelevant to timing. The network can shuffle and delay packets all it likes; the timestamps still say what happened when.

Stage 5 — Demuxer and jitter buffer: rebuild the clock, absorb the jitter

The receiver now has to undo the network's damage and recover the sender's clock. Two things happen here, and this is the first correction point in the pipeline.

First, clock recovery. In broadcast, the receiver feeds the incoming PCR samples into a phase-locked loop (PLL) — a circuit that steers a local 27 MHz oscillator until it matches the sender's. The PCR arrives roughly every 40 milliseconds or better, and the standard allows it to wobble by no more than ±500 nanoseconds of jitter (ISO/IEC 13818-1). The PLL averages out that wobble and produces a smooth, faithful copy of the sender's clock. In WebRTC, there is no PCR; instead the receiver waits for an RTCP Sender Report — a small control packet that carries a matched pair: one reading of the stream's RTP timestamp alongside one reading of the sender's NTP wall clock (RFC 3550, §6.4.1). That pair is the anchor. One pair per stream lets the receiver convert any RTP timestamp into wall-clock time, and once both audio and video are expressed in the same wall clock, they share a timeline. That is lip-sync, recovered.

Second, the jitter buffer. Recovered timestamps tell the receiver when each packet should play, but the packets arrived at uneven times. The jitter buffer is a short holding queue — the airport waiting area for packets. It deliberately delays playback by a small margin (typically 30–200 ms) so that late packets have a chance to arrive before their scheduled play time. The buffer trades a little latency for smooth, in-order, on-time playback. In WebRTC this buffer is NetEQ, and it is sophisticated enough to stretch or compress audio slightly to keep the buffer from emptying or overflowing.

Stage 6 — Decoder: timestamps ride along untouched

The decoder turns compressed packets back into raw samples and frames. It honors the DTS (decode this now) and preserves the PTS (this sample belongs at this instant) so the renderer knows where to place the output. The decoder does not invent or alter timing; it carries the PTS through. If a video packet was decoded out of display order because of B-frames, the decoder reorders the output back into presentation order using the PTS. The timing contract written at the encoder is finally cashed in here.

Stage 7 — Renderer: the playback clock, and the second correction point

The renderer is the sound card's digital-to-analog converter and the display's refresh. It has its own clock — the playback sample clock — and here is the catch that causes most long-call drift: the renderer's clock is a different crystal from the capture clock at the far end. If the capture side runs at 48,000.5 Hz and the playback side runs at 47,999.5 Hz, that one-part-per-million difference means the renderer slowly falls behind or races ahead of the incoming audio. Over a one-hour call, even a modest 50 parts-per-million mismatch accumulates to roughly 180 milliseconds — well past the point where lip-sync becomes visibly wrong.

So the renderer is the second correction point. A well-built renderer continuously measures how full its buffer is and applies tiny resampling — adding or removing a fraction of a sample here and there — to keep the playback clock locked to the recovered media clock. The correction is so small it is inaudible. A badly-built renderer skips this and instead drops or repeats whole frames when the buffer runs dry or overflows, which the listener hears as a click or a stutter. The difference between a product that sounds professional and one that stutters is almost entirely in how this last stage handles the clock mismatch.

The two worlds side by side: 27 MHz program clock vs NTP wall clock

The single most useful thing to internalize is that there are two distinct synchronization philosophies, and which one you are in changes where the anchor lives.

Aspect Broadcast / file (MPEG-TS, MP4) Real-time (WebRTC, RTP)
Master clock One System Time Clock at 27 MHz None shared; each stream independent
Anchor carried in-stream PCR (sample of the 27 MHz STC) RTCP Sender Report (NTP + RTP pair)
Clock-recovery mechanism Phase-locked loop on PCR Linear map from one SR pair
Audio/video timestamps PTS/DTS at 90 kHz, both off the STC Separate RTP timestamps, separate rates
Anchor cadence PCR every ≤40 ms SR every few seconds
Lip-sync established by Shared STC means PTS are directly comparable Converting both RTP clocks to NTP wall time
Jitter spec PCR jitter ≤ ±500 ns (ISO/IEC 13818-1) No spec; jitter buffer absorbs it

Read the table as one sentence per row. In broadcast, one clock rules everything, so two timestamps off that clock are comparable by subtraction. In WebRTC, two independent clocks must each be tied to a neutral third reference — the NTP wall clock — before they can be compared at all. Both designs solve the same problem; they just put the shared reference in a different place.

A worked example: how much does drift accumulate?

Drift is the failure mode people underestimate, so let us put real arithmetic on it. Suppose the capture device's audio clock and the playback device's audio clock differ by 50 parts per million — a realistic figure for two consumer devices with ordinary crystals.

Step one, what is 50 ppm as a fraction? 50 parts per million = 50 ÷ 1,000,000 = 0.00005.

Step two, how much time does that gain or lose per second? 0.00005 × 1 second = 0.00005 seconds = 0.05 milliseconds per second.

Step three, accumulate over a one-hour call: 0.05 ms/second × 3,600 seconds = 180 milliseconds.

Now compare 180 ms against the lip-sync tolerance the human eye actually enforces. ITU-R BT.1359-1 puts the threshold of acceptability at audio leading video by 90 ms or lagging by 185 ms; beyond that, sync is judged unacceptable. So an uncorrected 50 ppm mismatch will push a one-hour call right to the edge of "unacceptable" by the end. That is why the renderer's continuous-resampling correction is not a nicety — it is the thing standing between your product and a help-desk ticket.

The common mistakes that break sync

The same handful of errors account for most production sync bugs, and every one is a clock handed the wrong timestamp.

The "PTS is wall-clock time" mistake. Engineers new to media assume a presentation timestamp is a time of day they can compare to now(). It is not — it is a tick count on a clock whose zero is arbitrary. Comparing a PTS to system time without first establishing the stream's anchor produces nonsense.

The missing or stale anchor. In WebRTC, if the first RTCP Sender Report is lost and the receiver does not wait for the next one, both streams play smoothly on their own but never share a timeline — they are permanently, silently out of sync. The fix is to never render the second stream until at least one Sender Report has tied both to the wall clock. RFC 6051 and the WebRTC abs-capture-time header extension address the slow-lock version of this by carrying the mapping in-band.

The regenerating middlebox. A relay or SFU that rewrites timestamps onto its own clock, rather than forwarding them untouched, destroys the source timing. A correctly built SFU forwards RTP and RTCP timestamps verbatim; it is a pipe, not an editor.

The ignored clock mismatch. A renderer that plays at its own clock rate and never resamples against the recovered media clock will drift exactly as the arithmetic above shows. The symptom is the giveaway: sync that is fine at the start of a call and visibly wrong twenty minutes in is always a drift problem, never a one-off bug.

Where Fora Soft fits in

We have shipped the audio and video timing path in conferencing platforms, OTT and Internet-TV services, e-learning systems, telemedicine apps, and video surveillance products since 2005. The timestamp pipeline in this diagram is the layer we debug when a client reports "the audio is off" — and the cause is almost always one identifiable hop, not a vague quality problem. In real-time products we trace the RTP and RTCP path; in streaming and OTT products we trace PTS, DTS, and PCR through the container. Drawing the whole pipeline first, the way this article does, is how we find the broken clock quickly instead of guessing.

What to read next

Call to action

References

  1. IETF RFC 3550, "RTP: A Transport Protocol for Real-Time Applications", §5.1 (RTP timestamp) and §6.4.1 (Sender Report sender info, NTP/RTP pair), July 2003. https://www.rfc-editor.org/rfc/rfc3550.html — primary source for the RTP media timestamp and the RTCP Sender Report NTP/RTP anchor pair.
  2. IETF RFC 3551, "RTP Profile for Audio and Video Conferences with Minimal Control", §6 (payload type clock rates), July 2003. https://www.rfc-editor.org/rfc/rfc3551.html — establishes the 90,000 Hz video clock and 8,000 Hz G.711 clock conventions.
  3. IETF RFC 7587, "RTP Payload Format for the Opus Speech and Audio Codec", §4.1, June 2015. https://www.rfc-editor.org/rfc/rfc7587.html — Opus uses a 48,000 Hz RTP clock regardless of internal sample rate.
  4. ISO/IEC 13818-1 (MPEG-2 Systems), clauses on the System Time Clock (27 MHz), Program Clock Reference (33-bit 90 kHz base + 9-bit modulo-300 extension), PTS/DTS, and the System Target Decoder model; PCR accuracy ±500 ns. Current edition. https://www.iso.org/standard/75928.html — controlling document for the broadcast 27 MHz clock chain. ISO text is paywalled; clock structure cross-checked against MPEG-2 Systems reference material.
  5. ITU-R BT.1359-1, "Relative Timing of Sound and Vision for Broadcasting", Table 1 (detectability +45/−125 ms; acceptability +90/−185 ms), 1998. https://www.itu.int/rec/R-REC-BT.1359 — the lip-sync tolerance window the drift example is measured against. ITU full text is paywalled; thresholds taken from the recommendation's normative summary.
  6. IETF RFC 6051, "Rapid Synchronisation of RTP Flows", §2–3, November 2010. https://www.rfc-editor.org/rfc/rfc6051.html — carries the NTP/RTP mapping in-band to remove the multi-second sync-lock delay.
  7. W3C WebRTC and the abs-capture-time RTP header extension (WebRTC-stats / libwebrtc documentation). https://www.w3.org/TR/webrtc/ — the modern in-band capture-time mechanism referenced for fast WebRTC sync.
  8. ITU-T G.711, "Pulse Code Modulation (PCM) of Voice Frequencies" (8 kHz narrowband sampling), 1988 (reaffirmed). https://www.itu.int/rec/T-REC-G.711 — basis for the 8,000 Hz speech-audio clock rate cited.
  9. AES recommendation and engineering references on sample-clock drift and parts-per-million crystal tolerance, used to support the 50 ppm worked example. Cross-checked against multiple oscillator-tolerance engineering sources.