Why this matters
If you build anything that sends live media over the internet — a video conferencing app, a telemedicine platform, a live-streaming SFU, a WebRTC game — audio-video sync does not happen by itself, and the machinery that makes it happen is the RTP timestamp plus the RTCP sender report. A product manager needs to know these exist so they understand why "the audio and video are both fine on their own but drift apart" is a real, common, and fixable class of bug, not a mystery. An engineer needs to know exactly what the RTP timestamp counts, what the sender report carries, and how the receiver turns two unrelated clocks into one timeline, so they can read a WebRTC stats dump and tell a sync fault from ordinary jitter. This article gives both readers one model: each stream has its own clock, the sender report is the note that ties that clock to real time, and lip-sync is the receiver doing arithmetic on those notes.
The problem: two streams, two clocks, no shared zero
Start with the fact that surprises people who first meet real-time media. When you join a video call, your microphone audio and your camera video do not travel together. They are encoded separately, packetized separately, and sent as two independent streams of small packets using the Real-time Transport Protocol, called RTP. Each packet is stamped with an RTP timestamp — a number that says when, on the sender's media clock, the data in this packet was sampled. So far this sounds like enough to keep them aligned. It is not, for two reasons that compound.
The first reason is that the two clocks count in different units. By long-standing convention an audio stream's clock and a video stream's clock run at different rates, so "RTP timestamp 96000" means one elapsed time on the audio stream and a completely different elapsed time on the video stream. The second reason is worse: RFC 3550, the standard that defines RTP, requires each stream's timestamp to start at a random value. This is a deliberate security measure — a predictable starting point would make certain attacks easier — but it means the audio clock's zero and the video clock's zero have no relationship to each other. Two clocks, two different speeds, two random starting points. You cannot compare them.
This is the gap the RTCP sender report fills. Alongside the media streams, each sender periodically emits a small control packet over the RTP Control Protocol, called RTCP. The most important of these is the sender report, the SR. It carries one crucial pair of numbers: a reading of that stream's RTP media clock, and the real-world wall-clock time at the same instant, written in a universal format. Collect one such pair for the audio stream and one for the video stream, and suddenly both random, differently-paced clocks are pinned to the same external timeline. The RTP timestamp tells the receiver when, relative to the stream's own start; the sender report tells it what that means in real time. Lip-sync is what you get when you do this for two streams at once.
Figure 1. Why sender reports exist. Each RTP stream has its own random-start media clock; the sender report ties that clock to a shared wall clock, and only then can two streams be aligned.
What the RTP timestamp actually counts
The single most common misunderstanding about RTP is that the timestamp is a wall-clock time, like "14:32:07." It is not. The RTP timestamp counts ticks of the media's own sampling clock, and RFC 3550 is specific about how it behaves: it increments monotonically and linearly in time, it must be derived from a clock that increments at a constant rate, and — critically — it counts media time, not packets and not real time.
A concrete audio example makes this clear. Suppose you send a stream of Opus audio. Each RTP packet typically carries 20 milliseconds of sound. The RTP payload format for Opus, RFC 7587, fixes the clock rate at 48,000 ticks per second for every Opus mode and every internal sampling rate. So the timestamp advances by a fixed amount per packet:
ticks per packet = clock rate × packet duration
= 48,000 ticks/s × 0.020 s
= 960 ticks per 20 ms packet
If one packet's timestamp is 96,000, the next is 96,960, the one after that 97,920, and so on — each packet 960 ticks later than the last, regardless of when it was actually sent on the network. That last clause is the heart of it. The timestamp records when the audio was sampled, not when the packet left. If two packets are sampled 20 ms apart but the network delays the second one, their timestamps are still exactly 960 apart. The timestamp is a property of the media, not of the transmission.
Video works the same way but at a different rate. By convention, set in the RTP audio-video profile RFC 3551 and followed by essentially every video codec, video RTP streams use a 90,000-tick-per-second clock. Ninety kilohertz was chosen because it divides evenly by the common video frame rates — 90,000 ÷ 30 = 3,000 ticks per frame at 30 fps, 90,000 ÷ 25 = 3,600 at 25 fps, 90,000 ÷ 24 = 3,750 at 24 fps — so a frame interval is always a whole number of ticks. This is the same 90 kHz clock that runs through MPEG-2 transport streams; if you have read PTS, DTS, and the Elementary Stream Timestamp, it is the same clock domain, reused.
Here is the trap that makes sync hard, stated plainly. The audio stream counts at 48,000 per second; the video stream counts at 90,000 per second; and each started at its own random number. There is no arithmetic you can do on the two RTP timestamps alone that tells you which audio sample belongs with which video frame. You need an external bridge. That bridge is the sender report.
| Media | Typical RTP clock rate | Ticks per unit | Source |
|---|---|---|---|
| Opus audio | 48,000 Hz | 960 per 20 ms packet | RFC 7587 |
| Telephone-band audio (G.711) | 8,000 Hz | 160 per 20 ms packet | RFC 3551 |
| Video (all common codecs) | 90,000 Hz | 3,000 per frame at 30 fps | RFC 3551 |
Table 1. RTP media-clock rates by media type. The rates differ on purpose, which is exactly why two raw RTP timestamps can never be compared without the sender report's wall-clock bridge.
The NTP timestamp: a universal wall clock
To tie a media clock to real time, you need a way to write real time that every device agrees on. RTP borrows the format used by the Network Time Protocol, the NTP, the same protocol that keeps the clocks on computers and phones roughly correct. The NTP timestamp format is a 64-bit fixed-point number measuring seconds since a fixed origin: 0 hours UTC on 1 January 1900. The top 32 bits hold the whole seconds; the bottom 32 bits hold the fraction of a second.
Splitting 64 bits as 32 whole and 32 fractional gives enormous range and fine resolution at once. The 32 whole-second bits count up to about 136 years of seconds (which is why the format wraps in 2036, a known and managed event). The 32 fractional bits divide each second into about 4.3 billion parts, so the smallest representable step is about 233 picoseconds — far finer than any media timing needs. This is the format the sender report uses, and the same format the WebRTC "absolute capture time" header extension uses when it stamps the original capture instant directly onto a packet.
One practical note that trips up engineers reading raw packets. The full NTP timestamp is 64 bits, but RTP's compact "middle bits" form — used in some receiver-side fields — takes the low 16 bits of the seconds and the high 16 bits of the fraction, giving a 32-bit value that wraps every few seconds but is fine for short-term interval math. When you see a 32-bit NTP-ish value in a stats dump, it is almost always this compact form, not the full 64-bit one.
The RTCP sender report: the note that ties clock to time
Now the centrepiece. Every active sender in an RTP session periodically transmits an RTCP sender report. Inside it, the "sender info" block carries four numbers, and two of them do the synchronization work:
The NTP timestamp (64 bits) is the wall-clock time, in the NTP format above, at the instant this report was generated. The RTP timestamp (32 bits) is this stream's media-clock reading at that same instant — and RFC 3550 is explicit that it corresponds to the same time as the NTP timestamp, expressed in the same units and with the same random offset as the RTP timestamps on the data packets. The other two fields, the sender's packet count and octet count, are running totals used for statistics and loss estimation, not for sync.
That pairing is the whole trick. The data packets give the receiver a stream of RTP timestamps with no real-world meaning. The sender report hands over one matched pair — "RTP timestamp R happened at NTP time T" — and that single anchor lets the receiver convert any RTP timestamp on that stream into wall-clock time, because the clock rate is known and constant:
For an audio stream with a 48,000 Hz clock and a sender report
saying RTP timestamp R0 = NTP time T0:
wall_clock(R) = T0 + (R − R0) ÷ 48,000 seconds
Do the same for the video stream with its 90,000 Hz clock and its own sender report, and now every audio sample and every video frame has a wall-clock time on the same timeline. Lip-sync becomes a lookup: when the playout clock reaches wall-clock time t, present the audio sample and the video frame whose computed wall-clock times are both t. The two streams were never comparable on their own; the sender reports made them comparable.
WebRTC binds the two streams together with one more piece: the CNAME, a canonical name carried in RTCP that says "these streams come from the same source and should be played in sync." A receiver uses the CNAME to know that this particular audio stream and that particular video stream are a synchronization pair in the first place — without it, the receiver would not know which audio goes with which video in a multi-party call. RFC 8834, which defines how RTP is used in WebRTC, requires all streams from one source to share a single CNAME for exactly this reason.
Figure 2. The sender report's matched NTP/RTP pair, and how the receiver uses it to convert any media-clock reading into wall-clock time for both streams.
A worked example: lining up audio and video
Make the math concrete with a single sync calculation. Suppose a receiver has collected one sender report from each stream of a call:
Audio stream (Opus, 48,000 Hz clock):
SR says: RTP timestamp 720,000 ↔ NTP time T_a
Video stream (90,000 Hz clock):
SR says: RTP timestamp 1,350,000 ↔ NTP time T_v (and T_v = T_a, same instant)
A data packet now arrives on the audio stream carrying RTP timestamp 768,000. How far past the audio anchor is it, in real time?
Δticks = 768,000 − 720,000 = 48,000 ticks
Δtime = 48,000 ÷ 48,000 Hz = 1.000 second after T_a
→ this audio was sampled at wall-clock time T_a + 1.000 s
A video frame arrives carrying RTP timestamp 1,395,000. How far past the video anchor is it?
Δticks = 1,395,000 − 1,350,000 = 45,000 ticks
Δtime = 45,000 ÷ 90,000 Hz = 0.500 second after T_v
→ this frame was sampled at wall-clock time T_v + 0.500 s
Because T_a and T_v are the same wall-clock instant, the audio packet (anchor + 1.000 s) and the video frame (anchor + 0.500 s) are 500 milliseconds apart in real time — the audio is half a second ahead of that frame. The receiver now knows it must hold the audio back, or release the video first, so that the right sample and the right frame reach the speaker and screen together. Without the two sender reports, those two RTP timestamps — 768,000 and 1,395,000 — would have been meaningless to compare. With them, the offset is a one-line subtraction. That is the entire mechanism.
Why sync takes a moment to lock — and how WebRTC speeds it up
There is a catch built into the design. Sender reports do not come with every packet; they ride on RTCP, and RTCP is rate-limited. RFC 3550 caps total RTCP traffic at about 5% of the session bandwidth and sets a recommended minimum interval of 5 seconds between reports from a sender (with a scaled-down minimum for small sessions). The consequence is direct: a receiver cannot synchronize two streams until it has received at least one sender report on each stream, so at the start of a call there can be a window of up to several seconds where audio and video play but are not yet locked together. You have probably seen this — the first second or two of a call where lips and voice are slightly off, then they snap into place.
For most conferencing this is invisible. For applications that cannot tolerate it — layered video, fast channel-switching, late joiners to a broadcast — the IETF defined RFC 6051, "Rapid Synchronisation of RTP Flows." It offers three mechanisms: tightened RTCP timing so the first reports arrive sooner, a feedback message that lets a switching server demand an immediate resync, and a new RTP header extension that carries the NTP/RTP mapping inside the media packets themselves, so the receiver never has to wait for an RTCP sender report at all.
WebRTC took the header-extension idea furthest with its absolute capture time extension, which stamps selected RTP packets with the NTP wall-clock time of the moment the very first frame in that packet was captured. This is more than a speed-up. A plain sender report's timestamp is generated when the report is sent, after capture, encoding, and buffering have already added delay; the absolute-capture-time extension carries the true capture instant, with known hardware delays subtracted, so the receiver's alignment is anchored to reality rather than to the report-generation moment. It is also what makes sync survive a mixer — an SFU or MCU that re-packetizes streams and would otherwise destroy the original sender's timing. The extension lets the original capture time ride through the mixer untouched.
This is the first common mistake worth naming, because it is the one we see most in production. When a server in the middle — an SFU forwarding streams, an MCU mixing them, a recording pipeline re-stamping packets — regenerates RTCP sender reports using its own clock instead of preserving the original mapping, every downstream receiver synchronizes to the wrong anchor. The individual audio and video look perfect; they simply drift apart, because the note that tied them to real time was rewritten by a box that never saw the original capture. When a client reports "sync was fine peer-to-peer but breaks the moment we route through the server," the sender-report handling in that middle box is the first thing to inspect.
What goes wrong: drift, jitter, and a missing anchor
Because the receiver trusts the sender report's NTP/RTP pair completely, three distinct failures degrade sync, each with its own signature.
Clock drift is the slow one. The sender's media clock and the receiver's playout clock are different physical crystals, and no two crystals run at exactly the same rate. A few parts per million of difference is normal and unavoidable. Left uncorrected, even a tiny rate difference accumulates: a 10-parts-per-million offset drifts the two streams by 36 milliseconds over an hour, enough to be visibly off by the end of a long call. Sender reports counter this by arriving repeatedly — each fresh report re-anchors the mapping, so the receiver's estimate of "where real time is" keeps getting corrected before drift can pile up. This is why sender reports must keep coming for the whole session, not just at the start.
Network jitter is the noisy one. Packets do not arrive evenly spaced; routers and switches bunch them up and stretch them out. This does not corrupt the RTP timestamps — those are fixed at the sender and describe sampling time, not arrival time — but it does add noise to the receiver's measurement of when each report arrived, which slightly blurs the clock estimate. RTP defines an interarrival jitter statistic, reported back in receiver reports, precisely to quantify this: it is a smoothed average of the difference between the expected and actual spacing of packet arrivals, measured in the stream's own clock units. High interarrival jitter is the number that tells an engineer the network, not the encoder, is the problem.
A missing or wrong anchor is the fatal one. If sender reports never arrive — blocked by a firewall, dropped by a middlebox, or simply never generated — the receiver has RTP timestamps it can play smoothly within each stream but no way to relate audio to video. Each stream sounds and looks fine alone; together they are unmoored. This is the cause behind the most baffling sync bugs, because every per-stream metric looks healthy. The fault is the absence of the cross-stream bridge, not a defect in either stream.
The visible symptoms map straight to the causes. Slow, monotonic lip-sync drift that worsens over a long call is clock drift outrunning the report cadence. Sync that is jittery and unstable but not trending in one direction is network jitter blurring the anchor. Sync that never establishes at all — audio and video each fine, permanently offset — is a missing or rewritten sender report. For what counts as a perceptible offset, and the tolerance windows a fault has to cross before a viewer notices, see What "In Sync" Means: ITU-R BT.1359, Lip-Sync Windows, Perceptibility.
| Failure | What it is | What the receiver sees | Typical symptom |
|---|---|---|---|
| Clock drift | Sender and receiver crystals run at slightly different rates | Anchor slowly becomes stale between reports | Lip-sync drifts steadily; worse the longer the call |
| Network jitter | Packets arrive bunched and stretched | Noise in the measured report-arrival time | Sync wobbles, unstable, but not one-directional |
| Missing / wrong SR | No sender report, or one rewritten by a middlebox | No valid NTP/RTP anchor for a stream | Streams each fine alone, permanently out of sync |
| Sync lock delay | RTCP rate limits the first reports | No anchor yet for one or both streams | First ~1–5 s of a call slightly off, then snaps in |
Table 2. The four ways RTP/RTCP synchronization fails, what the receiver experiences, and the symptom a user reports. Limits and statistics come from RFC 3550 (RTCP timing, interarrival jitter) and RFC 6051 (sync lock delay).
RTP/RTCP sync versus the broadcast PCR model
It is worth placing this beside the broadcast world's answer to the same problem, because the contrast clarifies both. A satellite or cable transport stream solves "the decoder has no clock" by sending a program clock reference, the PCR — a periodic snapshot of one master clock that the decoder rebuilds and then reads every PTS and DTS against. One stream, one clock, sent down a one-way channel. We cover it in full in PCR in MPEG-TS: The Program Clock That Runs the Broadcast.
RTP faces a harder version of the problem. There is no single master clock and no guaranteed one-way channel; there are two or more independent streams, possibly even from different hosts, arriving over a lossy network with a back-channel. So instead of one embedded clock reference, RTP uses a pair of self-describing notes — the sender reports — and relies on the receiver to stitch the streams onto a shared NTP timeline. The broadcast model assumes a controlled pipe and a single clock; the RTP model assumes chaos and reconstructs order at the edge. The PCR is a clock the decoder copies; the sender report is a translation table the receiver applies. Different channels, different designs, same underlying goal: give every timestamp a real-world meaning.
When the reference clocks genuinely need to be shared across devices — professional live production over IP, where dozens of sources must be sample-accurate — the industry goes one step further and locks every device to a common Precision Time Protocol (PTP, IEEE 1588) clock, signalled in the session description per RFC 7273 and mandated by SMPTE ST 2110 for IP studio infrastructure. There the NTP timestamps in the sender reports are not just internally consistent; they are anchored to one physical grandmaster clock across the whole facility. For the contribution and live-IP side of this, see Live Audio: Contribution Codecs, AES67, SMPTE 2110-30.
Where this sits in the WebRTC pipeline
It helps to locate the sender report inside the larger machine. A WebRTC endpoint captures audio and video, encodes each, packetizes them into RTP, and sends them — while a parallel RTCP channel carries sender reports out and receiver reports back. On the receiving side, each stream flows through its own jitter buffer, which absorbs network jitter and reconstructs even playout for that stream alone. The synchronization layer sits above the two jitter buffers: it takes the wall-clock times derived from each stream's sender reports and tells the two buffers how much to delay one stream relative to the other so that matching audio and video reach the output together. The jitter buffer handles smoothness within a stream; the sender-report sync handles alignment between streams. They are different jobs done by different components. For the buffer half of this, see Jitter Buffer: NetEQ, the Brain of WebRTC Audio, and for the whole chain, The WebRTC Audio Pipeline End-to-End.
Where Fora Soft fits in
We have built video conferencing, telemedicine, e-learning, and live-streaming products on WebRTC since well before it was a finished standard, and cross-stream synchronization is where real-time pipelines quietly pass or fail. The recurring real-world problem is rarely in the codecs: a call is perfectly in sync peer-to-peer, then a media server is added to scale the room, and the audio and video begin to drift — because the server regenerates RTCP sender reports on its own clock instead of preserving the original capture mapping or forwarding the absolute-capture-time extension. When a client reports "sync was fine in our prototype and broke when we added the SFU," the sender-report and header-extension handling in that server is the first seam we inspect, because that is exactly where the bridge between media clock and real time gets silently rewritten.
What to read next
- PCR in MPEG-TS: The Program Clock That Runs the Broadcast
- Jitter Buffer: NetEQ, the Brain of WebRTC Audio
- What "In Sync" Means: ITU-R BT.1359, Lip-Sync Windows, Perceptibility
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your rtcp sender report synchronization plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the RTP / RTCP Sync Reference Card — One page: the RTP media-clock rates (Opus 48 kHz, G.711 8 kHz, video 90 kHz), the sender-report sender-info fields, the wall-clock conversion formula, and the four synchronization failure modes with their causes and symptoms.
References
- IETF RFC 3550 (July 2003), "RTP: A Transport Protocol for Real-Time Applications" (Internet Standard STD 64). The controlling standard. Defines the RTP timestamp (§5.1 — monotonic, linear, derived from a constant-rate clock, random initial value, counts media time not packets), the wallclock / NTP timestamp format (§4 — 64-bit fixed point, seconds since 0h UTC 1 January 1900, 32+32 split, and the compact "middle 32 bits" form), the RTCP sender report sender-info block (§6.4.1 — 64-bit NTP timestamp + 32-bit RTP timestamp corresponding to the same instant, plus packet/octet counts), the use of the NTP/RTP pair for inter-media synchronization, the interarrival jitter statistic (§6.4.1, Appendix A.8), and the RTCP transmission interval rules (§6.2 — 5% bandwidth, 5 s recommended minimum). Read directly from the RFC Editor HTML. https://www.rfc-editor.org/rfc/rfc3550.html — Tier 1 (official IETF Internet Standard).
- IETF RFC 3551 (July 2003), "RTP Profile for Audio and Video Conferences with Minimal Control" (STD 65). Establishes the per-payload clock-rate conventions — the 90,000 Hz video clock and the per-codec audio clock rates (e.g., 8,000 Hz for G.711). The source for Table 1's video and telephone-band rows. https://www.rfc-editor.org/rfc/rfc3551.html — Tier 1 (official IETF Internet Standard).
- IETF RFC 7587 (June 2015), "RTP Payload Format for the Opus Speech and Audio Codec." Fixes the Opus RTP timestamp clock rate at 48,000 Hz for all modes and all internal sampling rates. The source for the Opus row of Table 1 and the 960-ticks-per-20-ms worked example. https://www.rfc-editor.org/rfc/rfc7587.html — Tier 1 (official IETF RFC).
- IETF RFC 6051 (November 2010), "Rapid Synchronisation of RTP Flows." Defines the synchronization-delay problem (a receiver cannot sync until it has one SR per flow) and three mitigations: tightened RTCP timing, an AVPF feedback request for resync, and RTP header extensions carrying the NTP/RTP mapping in-band. The source for the "sync lock delay" row of Table 2 and the rapid-sync discussion. https://www.rfc-editor.org/rfc/rfc6051.html — Tier 1 (official IETF RFC).
- IETF RFC 8834 (January 2021), "Media Transport and Use of RTP in WebRTC." Specifies how RTP/RTCP are used in WebRTC, including the requirement that all RTP streams from one source share a single RTCP CNAME so receivers can identify synchronization pairs. The source for the CNAME discussion. https://www.rfc-editor.org/rfc/rfc8834.html — Tier 1 (official IETF RFC).
- IETF RFC 7273 (June 2014), "RTP Clock Source Signalling." Defines SDP signalling that identifies the reference clock (NTP, PTP/IEEE 1588, GPS) and media clock for a session. The source for the PTP / SMPTE 2110 reference-clock discussion. https://www.rfc-editor.org/rfc/rfc7273.html — Tier 1 (official IETF RFC).
- WebRTC project, "Absolute Capture Time" RTP header extension specification. First-party engineering specification from the libwebrtc maintainers. Defines the header extension that stamps a packet with the NTP capture time of its first frame (UQ32.32 NTP format, optional capture-clock-offset for mixers, interpolation rules). The source for the absolute-capture-time and mixer-survival discussion; subordinate to RFC 3550 for any normative timestamp definition. https://webrtc.googlesource.com/src/+/main/docs/native-code/rtp-hdrext/abs-capture-time/README.md — Tier 3 (first-party maintainer specification; experimental, not yet IETF-standardized).
- IEEE 1588-2019, "IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems" (PTP). The shared-reference-clock standard that professional IP production locks devices to, referenced by RFC 7273 and SMPTE ST 2110. Used for the PTP grandmaster-clock contrast in the broadcast-comparison section. https://standards.ieee.org/ieee/1588/6825/ — Tier 1 (official IEEE standard).
Note on a corrected oversimplification: a large share of online explainers describe the RTP timestamp as "the time the packet was sent" or treat it as a wall clock. This article follows RFC 3550 §5.1: the RTP timestamp counts ticks of the media sampling clock from a random starting value and reflects sampling time, not transmission time or wall time. The "RTP timestamp is wall time" framing is flagged as the single most common conceptual error in this topic — wall-clock meaning comes only from the NTP/RTP pair in the RTCP sender report, never from the RTP timestamp alone.


