Why this matters

If you are building a video conferencing product, a webinar platform, a telemedicine app, or any group-calling feature on WebRTC, "the audio is slightly ahead of the video" is a complaint that lands on your desk sooner or later — and it is almost never the thing people first blame. A product manager needs to understand why lip-sync that worked perfectly in a two-person demo falls apart in a fifty-person meeting, because that is a design decision, not a bug that appeared on its own. An engineer choosing between mediasoup, Pion, and LiveKit needs to know which one does the timestamp bookkeeping correctly and where each has known gaps. The whole point of lip-sync is that nobody notices it; the people who build the system are the only ones who ever have to think about it, and they have to think about it carefully.

A diagram contrasting two WebRTC topologies. On the left, a one-to-one peer-to-peer call: a sender box with a camera and microphone connects directly to a receiver box, with a single arrow labelled RTP plus RTCP Sender Reports carrying the original sender clock straight through. On the right, an SFU topology: a sender connects to an SFU box in the middle, which terminates the sender's RTCP and emits its own freshly generated Sender Reports on its own clock to the receiver. A red warning marker sits on the SFU, labelled "RTCP terminated and regenerated here", showing the point where lip-sync can break. Figure 1. In a one-to-one call the sender's timing reaches the receiver untouched. An SFU breaks that chain: it terminates the sender's RTCP and must regenerate Sender Reports on its own clock, and that is exactly where sync goes wrong.

First, what "in sync" actually requires

Before looking at any engine, it helps to be clear about what the receiver is trying to do. A speaker's audio and video are captured as two separate streams. They are encoded by two different codecs — Opus for the voice, something like VP8, VP9, AV1, or H.264 for the picture — and they travel as two independent flows of packets. Nothing in the network glues them together. The receiver has to reconstruct the original timing relationship: the instant of sound that went with the instant the lips were in a particular position has to come back out of the speakers at the same moment that frame appears on screen.

The tolerance for getting this wrong is well studied. The broadcast standard ITU-R BT.1359-1 puts the limit of acceptability at audio leading the video by about 90 milliseconds, or lagging it by about 185 milliseconds. Inside that window most viewers do not consciously notice. Outside it, the conversation starts to feel wrong — a dubbed-film effect that makes a speaker hard to follow and a product feel cheap. So the receiver's job is not to achieve perfect zero-offset alignment; it is to keep the offset inside that perceptual window, continuously, for the length of the call.

The two timestamps that make sync possible

WebRTC inherits its timing machinery from RTP and its control protocol RTCP, both defined in IETF RFC 3550. There are two timestamps in play, and the entire mechanism is about combining them.

The first is the RTP timestamp, carried in the header of every single media packet. It is a 32-bit counter that ticks at a clock rate fixed by the codec: 48,000 Hz for Opus audio, and a 90,000 Hz clock for video. This timestamp tells the receiver how much time passed between one packet and the next within a single stream — but it cannot tell you anything across streams. The audio clock and the video clock start at independent random offsets and tick at different rates, so an audio RTP timestamp of 480,000 and a video RTP timestamp of 900,000 have no direct relationship to each other. Think of the RTP timestamp as a stopwatch that started at a random number: useful for measuring intervals, useless for telling absolute time.

The second is the NTP timestamp, and it appears not on every packet but inside an RTCP Sender Report sent every few seconds. NTP time is wall-clock time — the actual time of day, expressed as a 64-bit value. The Sender Report is the crucial bridge: it pairs the two clocks together by saying, in effect, "at this wall-clock instant, my RTP timestamp for this stream was exactly this value." Once the receiver has that pair for the audio stream and a matching pair for the video stream, it can convert any RTP timestamp on either stream into a wall-clock time, and now the two streams live on one shared timeline. Lining them up becomes arithmetic.

One more piece holds it together: the CNAME, a canonical identifier carried in RTCP that is the same for every stream coming from one participant. The receiver uses the CNAME to know that this audio stream and that video stream belong to the same person and should be synchronized to each other. Without it, in a room full of people each sending audio and video, the receiver would have no reliable way to know which voice goes with which face.

A signal-flow diagram showing the timestamp mechanism. A sender emits two streams: an audio lane carrying RTP packets each stamped with a 48 kHz RTP timestamp, and a video lane carrying RTP packets each stamped with a 90 kHz RTP timestamp. Periodically, every few seconds, an RTCP Sender Report block is emitted for each lane, each block pairing an NTP wall-clock timestamp with the current RTP timestamp for that lane, and carrying a shared CNAME label. At the receiver, an arrow shows both lanes being converted onto a single shared NTP timeline where audio and video frames line up. Labels mark the 48 kHz audio clock, the 90 kHz video clock, the NTP-to-RTP pairing, and the shared CNAME. Figure 2. Each stream carries a fast RTP clock; the occasional Sender Report pairs that clock with NTP wall-clock time. The shared CNAME tells the receiver the two streams are one person, and the NTP pairing maps both onto one timeline.

A worked example: turning timestamps into a sync offset

Numbers make the mechanism concrete, so here is the arithmetic the receiver runs.

Suppose the receiver has an audio Sender Report saying that audio RTP timestamp 4,800,000 corresponds to NTP time T. Audio runs at 48,000 Hz, so a later audio packet with RTP timestamp 4,848,000 is exactly one second after T:

(4,848,000 − 4,800,000) ÷ 48,000 Hz = 48,000 ÷ 48,000 = 1.000 second after T

Now suppose the video Sender Report says that video RTP timestamp 900,000 also corresponds to NTP time T. Video runs at 90,000 Hz, so a video frame with RTP timestamp 990,000 is:

(990,000 − 900,000) ÷ 90,000 Hz = 90,000 ÷ 90,000 = 1.000 second after T

Both land at one second after the same wall-clock anchor T, so that audio packet and that video frame should be played at the same instant. The receiver delays whichever stream is ready early — usually it holds back the audio a little, because audio buffering is cheap and smooth — until the matching video frame is decoded and ready, then releases them together. If the receiver later computes that the audio is consistently landing 120 milliseconds before its matching video, it adds 120 milliseconds of buffering to the audio path to pull them back into the perceptual window. The entire scheme stands or falls on the NTP-to-RTP pairs in those Sender Reports being correct. Corrupt or omit them and the arithmetic produces a wrong answer with full confidence.

Why the SFU is where lip-sync breaks

In a one-to-one call, the Sender Reports the receiver reads were written by the actual capture device, so the NTP times are honest readings of the moment the media was captured. This is why one-to-one WebRTC lip-sync is, in practice, a solved problem — you get it for free.

Group calls are different because they almost always run through a selective forwarding unit, an SFU, which receives every participant's streams and forwards the ones each person needs. Crucially, an SFU is an RTCP-terminating intermediary. It does not pass the sender's RTCP packets straight through. It consumes them and generates its own Sender Reports toward each receiver — because it is rewriting RTP timestamps as it forwards (to handle stream switching, simulcast layer changes, and re-packetization), so the original sender's NTP-to-RTP pairs would no longer match the packets actually going out.

This is the heart of the matter. The NTP-to-RTP mapping the receiver builds is based on the SFU's clock and the SFU's own processing, not the original sender's. Any asymmetry in how the SFU handles audio versus video internally — different queuing, different rewrite logic, a report sent for one stream but not the other — gets baked directly into the mapping the receiver relies on, and the result is audio and video that no longer line up. The SFU has taken on the job of being an honest clock, and if it does that job sloppily, every receiver inherits the error.

There is a modern fix for this, and it matters for all three engines below. The WebRTC project defines an RTP header extension called Absolute Capture Time whose entire purpose, in its own words, is "to provide a way to accomplish audio-to-video synchronization when RTCP-terminating intermediate systems (e.g. mixers) are involved." It stamps individual RTP packets with the NTP timestamp of when the frame was originally captured, based on the same clock the capture device uses for its Sender Reports. If the SFU forwards this extension faithfully, the receiver can recover the original capture timing even though the SFU rewrote everything else. An SFU that understands and preserves Absolute Capture Time can keep sync that a report-only SFU would lose. An intermediary may also rewrite the capture timestamps to its own clock if it wants to present itself as the capture system, but the better behaviour for a forwarding SFU is to pass the original through.

How mediasoup does it

mediasoup is a low-level SFU library, and it handles the timestamp rewriting explicitly. When a consumer's source stream is switched — for example when a participant's simulcast layer changes or a new producer takes over a slot — mediasoup rewrites the outgoing RTP timestamps so they stay continuous, and it calculates the new timestamp from the relationship recovered from the producers' Sender Reports. In other words, it uses the incoming NTP-to-RTP pairing to compute a correct outgoing one rather than guessing. This is the right approach, and it is the reason mediasoup-based systems generally keep sync through layer switches that would otherwise jolt the stream.

The catch with a library this low-level is that the application is responsible for the parts mediasoup does not impose. mediasoup will forward the timestamps and reports it is configured to handle, but the developer has to make sure audio and video timestamps are calculated the same way and that the relevant header extensions, including Absolute Capture Time where it helps, are negotiated and forwarded. mediasoup gives you correct primitives; it does not protect you from wiring them up inconsistently. In our experience the typical mediasoup sync bug is not in mediasoup itself but in application code that treats the audio and video paths slightly differently.

How Pion does it

Pion is a WebRTC implementation in Go, and it is the foundation under a large number of custom SFUs and media servers — including, historically, parts of the LiveKit stack. Pion exposes the timing machinery directly rather than hiding it. Its rtcp package defines the SenderReport structure with the NTPTime and RTPTime fields spelled out, and its interceptor framework includes a report interceptor that generates Sender and Receiver Reports from the RTP and RTCP flowing through. As the Pion maintainers put it, a Sender Report lets you "map two different RTP streams together by using RTPTime and NTPTime," so you "know at what absolute time something occurred," and then "in your playback application you can buffer/playout to ensure they are synced."

What this means in practice is that Pion is a toolkit, not a finished product. If you build an SFU on Pion, you own the synchronization logic. Pion hands you correct Sender Reports and the building blocks to generate your own as you forward, but the decision to rewrite timestamps coherently across audio and video, to forward Absolute Capture Time, and to emit Sender Reports for both streams on a regular cadence is yours. This is the great strength and the great trap of Pion: nothing is hidden, and nothing is done for you. Teams that understand the timestamp model build excellent sync on Pion; teams that treat it as a black box ship lip-sync bugs.

How LiveKit does it

LiveKit is a full SFU built (historically, on Pion) to be used as a product rather than a library, and it does the synchronization bookkeeping for you. Its server reads the NTP timestamps in the Sender Reports of the tracks it forwards and uses them to keep audio and video aligned for each participant. For most teams this is the right level of abstraction — you get working sync without writing the timestamp arithmetic yourself.

LiveKit's own issue tracker is the most honest illustration of how delicate this is, and it is worth knowing the failure modes because they generalize to every SFU. The project has documented a case where the server stopped sending RTCP Sender Reports for an audio track to a downstream consumer when the audio codec was RED — the redundancy encoding used to make Opus resilient to packet loss. No audio Sender Report means no NTP anchor for the audio stream, which means the receiver cannot map audio onto the shared timeline, which means lip-sync drifts. A separate recording-and-egress issue showed the same shape from the other direction: Sender Reports arriving for the video track but not the audio track, so the recording pipeline could not align the two. In both cases the bug was not in the sync arithmetic — it was a missing Sender Report, the input the arithmetic depends on. That is the lesson that matters more than any single engine: sync logic is only as good as the timing reports feeding it, and the most common production failure is a report that silently stopped being sent.

The two strategies and three engines compared

Read this table one row at a time. The first two rows are the two timestamp mechanisms; the last three are the engines.

Mechanism / engine What it provides Where it can fail
RTCP Sender Report (RFC 3550) NTP-to-RTP pair per stream; the classic anchor SFU regenerates it on its own clock; a missing report kills sync for that stream
Absolute Capture Time extension Original capture NTP time stamped on packets, survives RTCP-terminating SFUs Only works if every hop negotiates and forwards the extension
mediasoup Rewrites RTP timestamps from the SR NTP relationship on stream switch App must wire audio/video paths identically and negotiate the right extensions
Pion Correct SR primitives and a report interceptor; full control You own all sync logic; nothing is done automatically
LiveKit Server-side sync from track SR NTP timestamps; works out of the box Edge cases (e.g. RED audio, egress) where an SR stops being sent break sync

The honest summary: at the protocol level lip-sync is well-defined and the standard mechanisms work. At the system level the risk is always the same — an RTCP-terminating SFU that regenerates timing imperfectly, or simply fails to emit a Sender Report for one of the two streams. The Absolute Capture Time extension is the strongest defence because it lets the original capture timing survive the SFU, but only if it is negotiated end to end.

The common mistakes that turn sync into a bug

The same handful of errors account for most WebRTC lip-sync complaints, and each is a misunderstanding of where the timing lives.

The "split the media paths" mistake. The single most damaging architectural decision is sending audio through one server and video through another — for example, bolting a video SFU onto a legacy audio-mixing MCU. The two paths now have independent, unrelated timing and there is no shared anchor to sync against. As the WebRTC community puts it bluntly: go all-in on the SFU or all-in on the MCU, but never split the audio and video processing across separate media servers.

The "blame the network" mistake. When sync drifts, engineers reach for packet captures and jitter graphs. But network jitter scrambles arrival times symmetrically and shows up as stutter, not as a steady one-directional slide. A sync error that is fine at the start of a call and predictably worse over time is a clock or timestamp problem, not jitter — usually a Sender Report issue or accumulated clock drift.

The missing Sender Report. As LiveKit's own bugs show, the most common concrete cause is an SFU that simply stops emitting a Sender Report for one stream — under a particular codec configuration, on a particular path. The audio loses its NTP anchor and floats. Always verify in chrome://webrtc-internals that Sender Reports are arriving for both the audio and the video stream.

The unforwarded capture-time extension. Teams enable Absolute Capture Time on the sender, assume it solves SFU sync, and never check that the SFU actually forwards it to the receiver. An extension that is dropped at the first hop does nothing. It has to be negotiated and preserved on every hop in the chain.

Where Fora Soft fits in

We have built the media path in video conferencing platforms, webinar and e-learning systems, telemedicine apps, and live-streaming products since 2005, including custom SFUs on mediasoup and Pion and deployments on LiveKit. Lip-sync is one of the quietest parts of that work and one of the most revealing: when a client reports "the audio is slightly ahead in group calls but fine one-to-one," we know to look first at how the SFU regenerates Sender Reports and whether it forwards Absolute Capture Time, not at the codec or the network. The fix is almost always restoring an honest timing anchor for both streams — making the SFU emit a Sender Report for the audio track it was quietly dropping, or wiring the capture-time extension through every hop.

What to read next

Call to action

References

  1. IETF RFC 3550, "RTP: A Transport Protocol for Real-Time Applications", §6.4.1 (Sender Report: the NTP/RTP timestamp pair used for inter-media synchronization) and §6.5.1 (CNAME), July 2003. https://www.rfc-editor.org/rfc/rfc3550.html — primary source for the Sender Report mechanism and the CNAME that groups a speaker's audio and video. Tier 1.
  2. IETF RFC 7587, "RTP Payload Format for the Opus Speech and Audio Codec", §4.1 (Opus uses a 48,000 Hz RTP clock), June 2015. https://www.rfc-editor.org/rfc/rfc7587.html — establishes the 48 kHz audio RTP clock used in the worked arithmetic. Tier 1.
  3. ITU-R BT.1359-1, "Relative Timing of Sound and Vision for Broadcasting", Table 1 (acceptability: audio +90 ms lead / −185 ms lag), 1998. https://www.itu.int/rec/R-REC-BT.1359 — the lip-sync tolerance window the receiver must stay inside. ITU full text is paywalled; thresholds taken from the recommendation's normative summary. Tier 1.
  4. WebRTC project, "Absolute Capture Time" RTP header extension specification, accessed 2026-06-07. https://webrtc.googlesource.com/src/+/main/docs/native-code/rtp-hdrext/abs-capture-time/README.md — first-party specification of the capture-time extension designed for RTCP-terminating intermediaries such as SFUs and mixers. Tier 2 (standards-body reference implementation / experimental spec).
  5. Pion project, rtcp package SenderReport (sender_report.go, NTPTime / RTPTime fields), accessed 2026-06-07. https://github.com/pion/rtcp/blob/main/sender_report.go — first-party source for how Pion models the Sender Report used for A/V sync. Tier 3 (maintainer source).
  6. Pion WebRTC Discussion #1825, "rtp to webrtc: How to handle audio/video synchronization?", accessed 2026-06-07. https://github.com/pion/webrtc/discussions/1825 — Pion maintainer explanation that the SR RTPTime/NTPTime pair lets an application map two RTP streams together and buffer for sync. Tier 3.
  7. versatica/mediasoup, CHANGELOG entry for v3.0.2 (RTP stream switching rewrites packet timestamps using the SenderReports' NTP relationship) and Issue #301 (consumer RTP timestamp offset), accessed 2026-06-07. https://github.com/versatica/mediasoup/issues/301 — first-party evidence that mediasoup computes outgoing timestamps from the incoming SR NTP relationship on stream switch. Tier 3.
  8. livekit/livekit Issue #3478, "When the audio codec is RED then RTCP sender reports not sent to downtrack", accessed 2026-06-07. https://github.com/livekit/livekit/issues/3478 — documented LiveKit failure mode where a missing audio Sender Report under RED breaks A/V sync. Tier 3.
  9. livekit/egress Issue #847, "Audio Video not in sync — sync logic in egress not working", accessed 2026-06-07. https://github.com/livekit/egress/issues/847 — second documented LiveKit case: Sender Reports received for video but not audio, preventing NTP-based alignment. Tier 3.
  10. Tsahi Levent-Levi, "Lip synchronization and WebRTC applications", BlogGeek.me, 2024 (updated 2025). https://bloggeek.me/lip-synchronization-webrtc/ — vendor-independent engineering explanation that 1:1 sync is solved while multiparty and split-path architectures break it; source for the "go all-in SFU or MCU" guidance. Tier 4.