Why this matters
Out-of-sync lips are the single defect a viewer notices fastest and complains about loudest — faster than a soft picture, faster than a quiet mix. If you build a video conferencing tool, a streaming service, an OTT app, or a telemedicine platform, "the audio is off" is a support ticket you will receive, and the first question your engineers must answer is whether the offset is actually outside the tolerance window or merely something one sensitive user can feel. Without the standard numbers, that argument never ends. This article gives the product manager, the QA lead, and the engineer one shared vocabulary and one shared table, so the conversation becomes "we are 70 milliseconds late, which is inside BT.1359's detectability threshold but past EBU R37's per-stage limit" instead of "it feels broken to me." Knowing the window also tells you where to spend effort: you do not need perfection, you need to stay inside a band that is wider than most people assume.
The thing nobody admits: perfect sync does not exist
Start with a fact that surprises most people. When you watch a person speak from across a room, the sound of their voice reaches you after you see their lips move. Light travels at about 300 million metres per second; sound travels at about 343 metres per second in air. Across a 10-metre room, the light arrives in roughly 33 nanoseconds — effectively instant — while the sound takes about 29 milliseconds to cover the same distance. So in ordinary life, the audio is always a little late, and your brain has spent your whole life learning that this is normal.
That single fact explains almost everything about lip-sync tolerance. The brain is not measuring whether sound and picture line up to the microsecond. It is asking a looser question: does this pairing fall within the range of relationships I have learned to expect between a sight and the sound it makes? Because late sound is the natural case, the brain tolerates a lot of lateness and very little earliness. Sound arriving before the picture never happens in nature, so when it happens on a screen, it feels wrong much sooner.
The technical name for the small range of timing differences a viewer accepts is the synchronization tolerance window, often shortened to the lip-sync window. The job of every standard in this article is to put numbers on that window. The job of every engineer is to keep the delivered offset inside it.
A note on language before we go further, because half of all lip-sync confusion is vocabulary. Throughout this article, a positive number means the sound is ahead of the picture (audio leads video — the unnatural, sooner-detected case). A negative number means the sound is behind the picture (audio lags video — the natural, better-tolerated case). This is the sign convention used by ITU-R BT.1359, and we will hold to it everywhere below. Whenever a number could be ambiguous, we also spell it out in words.
Figure 1. The lip-sync tolerance window from ITU-R BT.1359-1. The window is not centred on zero — it extends much further to the left (sound late) than to the right (sound early), because human perception forgives late sound and punishes early sound.
The numbers everyone googles: ITU-R BT.1359-1
When a broadcast engineer needs to settle a lip-sync argument, they reach for one document: ITU-R BT.1359-1, "Relative timing of sound and vision for broadcasting," approved in November 1998 and still in force in 2026. It is short, it is a free download from the ITU, and it does one valuable thing: it converts decades of perceptual research into two pairs of numbers.
The recommendation defines two thresholds, each as a window with a right edge (sound ahead) and a left edge (sound behind).
The threshold of detectability is the point at which an average viewer begins to notice that something is off, even if they cannot say what. BT.1359-1 puts it at +45 milliseconds (sound ahead of picture) to −125 milliseconds (sound behind picture). Inside this window, the great majority of viewers perceive the audio and video as synchronized. This is your target band: if you stay inside +45 / −125, you have effectively won.
The threshold of acceptability is the wider point beyond which the mismatch is not just noticeable but bothersome — the viewer is now distracted and the experience is degraded. BT.1359-1 puts it at +90 milliseconds (sound ahead) to −185 milliseconds (sound behind). Between detectability and acceptability is a grey zone: sensitive viewers notice, but most will tolerate it for the length of a programme.
Read those two windows again and the asymmetry jumps out. On the "sound behind picture" side, you have 125 milliseconds of room before detection and 185 before it becomes objectionable. On the "sound ahead of picture" side, you have only 45 and 90. The left half of the window is roughly 2.7 times wider than the right half. That is not an accident or a rounding artefact; it is the room-acoustics fact from the previous section, measured and written into a standard. Late sound is natural, so we tolerate almost three times as much of it.
Here is the BT.1359-1 window laid out as a single table — the table the open internet strangely never hosts cleanly in one place.
| Threshold | Sound ahead of picture (audio leads) | Sound behind picture (audio lags) | What it means for you |
|---|---|---|---|
| In sync (target) | up to +45 ms | up to −125 ms | Virtually all viewers perceive sync. Aim here. |
| Detectability | +45 ms | −125 ms | Average viewer begins to notice. The edge of "good". |
| Acceptability | +90 ms | −185 ms | Beyond here it is distracting and degrades quality. |
| Objectionable | beyond +90 ms | beyond −185 ms | A defect. File the bug. |
Table 1. ITU-R BT.1359-1 (11/98) lip-sync thresholds. Positive = sound ahead of picture; negative = sound behind picture.
One discipline point worth stating plainly, because it is the most common mistake online. Many articles quote a single symmetric number — "lip-sync must be within ±40 ms" or "within one frame" — as if the window were centred on zero. It is not. BT.1359-1's window is asymmetric, and collapsing it to one symmetric figure throws away the most useful thing the standard tells you: that you have far more headroom for late audio than for early audio. When you design a buffer or a delay correction, that asymmetry is exactly the slack you want to exploit.
Where the numbers come from: perception, not engineering
The BT.1359-1 thresholds did not start in a standards committee. They started in perception laboratories, decades earlier, and the most-cited single study is Dixon and Spitz (1980), "The Detection of Auditory Visual Desynchrony."
Their method is elegant and worth picturing. They showed participants a continuous video — either a person talking or a hammer hitting a peg — that began perfectly in sync, then slowly drifted out of sync at a constant rate of about 51 milliseconds of offset per second of video. Participants pressed a button the moment they first noticed the audio and picture had come apart. Drifting slowly, rather than jumping straight to a fixed offset, mimics how real broadcast and streaming systems actually fail: sync rarely breaks all at once; it creeps.
The findings map directly onto the BT.1359 asymmetry. For the talking face, viewers tolerated the sound lagging by about 258 milliseconds before noticing, but only the sound leading by about 131 milliseconds. For the hammer — a sharp, percussive event with a crisp visual impact — the thresholds tightened to about 188 milliseconds lag and only 75 milliseconds lead. Two lessons fall out. First, in both cases late sound is tolerated far more than early sound, confirming the natural-acoustics story. Second, the type of content matters: a sharp transient like a hammer strike (or a drum hit, or a clapboard) is policed by the brain far more strictly than the soft, blurry articulation of continuous speech.
BT.1359 takes the cautious path. Rather than adopting the generous thresholds you get from continuous speech, it sets tighter, content-independent numbers that hold up even for the demanding transient cases. That is why the standard's +45 / −125 is stricter than Dixon and Spitz's speech-only +131 / −258: a broadcast standard must protect the worst case, not the average one.
There is a deeper reason the brain is so invested in lip timing, and it has a name: the McGurk effect, from McGurk and MacDonald (1976). If you play the sound "ba" while showing a face mouthing "ga," most people hear neither — they hear "da," a fusion the brain invents to reconcile the conflicting senses. The illusion is involuntary; it persists even when you know the trick. The takeaway for us is that the brain does not treat lip movement as decoration on top of speech. It uses the lips as primary evidence about what is being said, fusing sight and sound below the level of conscious thought. When the two drift apart, you are not just noticing a cosmetic glitch — you are breaking a perceptual mechanism the brain runs constantly to understand speech. That is why lip-sync errors feel so viscerally wrong, and why they are the defect viewers report first.
Figure 2. Perceptual detection thresholds from Dixon & Spitz (1980) for speech versus a sharp transient, with the BT.1359 broadcast threshold beneath. Sharp transients are policed more strictly, so the standard sets numbers that protect them.
The other standards you will meet
BT.1359 is the perceptual anchor, but it is a viewer-side recommendation — it describes what the audience can tolerate at the end of the chain. In production you will meet two more documents that are stricter, because they govern points upstream where errors still have room to accumulate.
EBU R37, "The relative timing of the sound and vision components of a television signal," is the European broadcasters' production rule. It does two things BT.1359 does not. First, it sets a tight per-stage budget: at any single piece of equipment in the chain, audio must stay within +5 milliseconds (ahead) to −15 milliseconds (behind) of video. Second, it sets an overall limit at the output that feeds the transmitter: +40 milliseconds (ahead) to −60 milliseconds (behind). The logic is a budget you can manage: if every stage stays within its small allowance, the errors sum to something the viewer never notices. The per-stage numbers are not perceptual thresholds — nobody can feel 5 milliseconds — they are error-accumulation limits designed so a ten-stage chain still lands comfortably inside BT.1359's window.
ATSC IS-191 is the North American counterpart, an Implementation Subcommittee finding for digital television. It specifies that at the input to the DTV encoder, the sound should never lead the video by more than 15 milliseconds and never lag by more than 45 milliseconds. Like EBU R37, it is deliberately tighter than the viewer threshold, and it polices the encoder input specifically — the point after which any error is baked into the broadcast stream and can no longer be cheaply fixed.
Notice the pattern across all three. The viewer-side recommendation (BT.1359) is the loosest, because it describes the final perceptual budget. The production rules (EBU R37, ATSC IS-191) are tighter, because they govern intermediate points that still have downstream stages to feed. This is the same logic as a financial budget: the closer you are to the start of the chain, the less of your total allowance you are permitted to spend, so there is room left for everyone after you.
| Standard | Scope | Sound ahead limit | Sound behind limit | Why this strictness |
|---|---|---|---|---|
| ITU-R BT.1359-1 | Viewer perception (detectability) | +45 ms | −125 ms | The final perceptual budget. |
| ITU-R BT.1359-1 | Viewer perception (acceptability) | +90 ms | −185 ms | The outer "still tolerable" edge. |
| EBU R37 | Per production stage | +5 ms | −15 ms | Error-accumulation control. |
| EBU R37 | Overall, to transmitter | +40 ms | −60 ms | Sum of stages, still inside BT.1359. |
| ATSC IS-191 | DTV encoder input | +15 ms | −45 ms | Lock sync before it is baked in. |
Table 2. The three standards side by side. All use the BT.1359 sign convention: positive = sound ahead of picture. Production rules are tighter than viewer perception because they govern upstream points.
Turning windows into frames: the unit your video team actually uses
Audio engineers think in milliseconds; video engineers think in frames. To talk to both, you need the conversion, and it is one division.
A frame lasts one over the frame rate. Plug in the common rates:
frame duration = 1 ÷ frame rate
24 fps → 1 ÷ 24 = 0.04167 s = 41.67 ms per frame
25 fps → 1 ÷ 25 = 0.04000 s = 40.00 ms per frame
30 fps → 1 ÷ 30 = 0.03333 s = 33.33 ms per frame
Now the BT.1359 detectability window becomes intuitive in frames. The "sound behind" edge of −125 milliseconds is about three frames at 24 fps (125 ÷ 41.67 ≈ 3.0) and about 3.75 frames at 30 fps (125 ÷ 33.33 ≈ 3.75). The "sound ahead" edge of +45 milliseconds is barely one frame at 24 fps (45 ÷ 41.67 ≈ 1.08). So the old broadcast rule of thumb — "keep it inside one frame ahead, three frames behind" — is just BT.1359 translated into frame counts at film and broadcast frame rates. It is correct, and now you know exactly where it comes from.
This conversion is also where a common mistake hides. A team will say "we're within one frame, so we're fine," meaning one frame in either direction. But one frame ahead at 24 fps (about 42 ms) is right at the detectability edge, while one frame behind (also about 42 ms) is comfortably inside it. The symmetric "one frame" rule is too generous on the early side and too strict on the late side. The window is asymmetric; your tolerance check should be too.
A worked example: budgeting sync across a real pipeline
Numbers in a table feel abstract until you spend them. Walk through a simplified live-streaming chain and watch the sync budget drain stage by stage. Positive still means audio ahead; negative means audio behind.
Suppose a live event flows through five stages, and at each one the audio and video paths have slightly different delays:
Stage 1 Camera + microphone capture audio −8 ms (mic path slightly slower)
Stage 2 Audio + video encoding audio −12 ms (audio encoder buffers more)
Stage 3 Packaging into segments audio +5 ms (video segmenter adds delay)
Stage 4 CDN + network transport audio 0 ms (both ride the same packets)
Stage 5 Player decode + render audio −10 ms (video decode is heavier)
Total accumulated offset = (−8) + (−12) + (+5) + (0) + (−10) = −25 ms
The audio ends up 25 milliseconds behind the video. Check it against the table: −25 milliseconds is well inside BT.1359's −125 detectability threshold, inside EBU R37's −60 overall limit, and inside ATSC IS-191's −45 encoder limit. This pipeline ships. No correction needed.
Now change one number. Suppose the audio encoder in Stage 2 stalls under load and adds −90 milliseconds instead of −12. The total becomes −8 − 90 + 5 + 0 − 10 = −103 milliseconds. That is still inside BT.1359 detectability (−125) — most viewers will not notice — but it has blown past EBU R37's overall limit (−60) and past ATSC IS-191 (−45). A broadcaster would reject it; a streaming service might ship it and gamble that viewers stay below the detection threshold. This is the real decision teams face, and the table is what makes it a decision rather than a guess.
Common mistake: chasing zero. Engineers new to sync often try to drive the offset to exactly 0 ms and burn enormous effort doing it. They are optimizing for a point that does not exist perceptually and that the natural world never delivers. The correct goal is to land inside the window — ideally a little on the late side, where the window is widest, never on the early side. A delivered offset of −30 ms (audio slightly behind) is a better engineering target than a fragile, expensive attempt at 0 ms, because it sits in the deepest part of the tolerance basin and degrades gracefully if a downstream stage adds a little more lag.
Why streaming and real-time break sync differently
The standards above were written for broadcast, where one clock runs the whole chain. Modern delivery breaks that assumption in two different ways, and each creates its own sync failure mode.
In adaptive streaming — HLS, DASH, CMAF — audio and video are usually separate tracks, cut into segments, and reassembled by the player using timestamps. Sync depends entirely on those timestamps being correct and on the player aligning them properly. The classic failure is an ad insertion: the main programme and the ad were encoded by different systems with different timing assumptions, and the splice point introduces an offset the player cannot fix. This is why a stream that is perfectly in sync during the show drifts during the commercial. The detailed timestamp mechanics live in our companion article on PTS, DTS, and the elementary stream timestamp, and the streaming-specific failures in lip-sync in HLS and DASH: where it usually goes wrong.
In real-time communication — WebRTC video calls — audio and video travel as independent RTP streams, each with its own clock, and the receiver must align them using the timing information in RTCP sender reports. A freshly joined participant may be briefly out of sync for the first second of a call, before the receiver has enough timing data to lock the two streams together. We cover this in RTP timestamps, RTCP sender reports, and NTP synchronization and the SFU-specific behaviour in lip-sync in WebRTC.
The perceptual window is the same in all three worlds — a viewer's brain does not care whether the offset came from a broadcast chain, an ad splice, or an RTP misalignment. BT.1359's +45 / −125 is the universal yardstick. What changes is how sync breaks and where you go to fix it. The window tells you whether you have a problem; the delivery-specific articles tell you why and how.
Where Fora Soft fits in
We have shipped lip-sync-critical audio in video conferencing, telemedicine, e-learning, OTT, and live streaming since 2005, and in every one of those verticals sync is the defect users notice first. In real-time products the hard part is aligning independent audio and video RTP streams under variable network jitter; in streaming products it is keeping separate audio and video renditions aligned across segment boundaries and ad inserts. The standards in this article are the yardstick we measure against during QA, and the asymmetric window is the slack we design buffers around — biasing toward slightly-late audio, where perception is most forgiving, rather than chasing a zero that does not exist. When a client reports an audio problem, the first thing we establish is whether the measured offset is actually outside BT.1359's window or merely something one sensitive reviewer can feel, because that single distinction decides whether there is a bug to fix at all.
What to read next
- PTS, DTS, and the Elementary Stream Timestamp
- RTP Timestamps, RTCP Sender Reports, and NTP Synchronization
- Lip-Sync Test Methodology: Clappers, Time-of-Flight, Perceptual Checks
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your lip sync tolerance plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Lip-Sync Tolerance Cheat Sheet — One page: ITU-R BT.1359, EBU R37, ATSC IS-191 thresholds with the sign convention, plus the frame conversions at 24/25/30 fps.
References
- ITU-R BT.1359-1, "Relative timing of sound and vision for broadcasting," International Telecommunication Union, Recommendation approved 1998-11-30, status In Force (Main). The controlling perceptual standard. Detectability threshold +45 ms / −125 ms; acceptability threshold +90 ms / −185 ms (positive = sound ahead of vision). Read from the ITU recommendation page (free download). https://www.itu.int/rec/R-REC-BT.1359-1-199811-I/en — Tier 1 (official ITU-R Recommendation).
- EBU R 037, "The relative timing of the sound and vision components of a television signal," European Broadcasting Union, current revision (2007). Per-stage limit +5 ms / −15 ms; overall-to-transmitter limit +40 ms / −60 ms. https://tech.ebu.ch/publications/r037 — Tier 1 (official EBU Recommendation).
- ATSC IS-191, "Relative Timing of Sound and Vision for Broadcast Operations," ATSC Implementation Subcommittee Finding. At DTV encoder input: audio not to lead video by more than 15 ms, not to lag by more than 45 ms. — Tier 1 (official ATSC finding).
- N. F. Dixon and L. Spitz, "The Detection of Auditory Visual Desynchrony," Perception, vol. 9, no. 6, pp. 719–721, 1980. Source of the speech (lag 258 ms / lead 131 ms) and transient (lag 188 ms / lead 75 ms) detection thresholds and the 51 ms/s drift method. https://journals.sagepub.com/doi/10.1068/p090719 — Tier 5 (peer-reviewed primary study; primary source for the perception facts).
- H. McGurk and J. MacDonald, "Hearing lips and seeing voices," Nature, vol. 264, pp. 746–748, 1976. The McGurk effect — involuntary audiovisual integration in speech perception; basis for why lip timing is perceptually load-bearing. https://www.nature.com/articles/264746a0 — Tier 5 (peer-reviewed primary study).
- ITU-R BT.1359-1 Table of Contents / normative text, ITU document server (Annex 1 user requirements; Appendix 1 explanation of recommended values; Appendix 2 subjective assessment conditions). https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.1359-1-199811-I!!TOC-TXT-E.txt — Tier 1 (official ITU-R document structure).
- TV Technology, "Managing lip sync" and "A/V Synchronization: How Bad Is Bad?" Practitioner articles confirming the BT.1359 / ATSC IS-191 numbers as used in broadcast operations. https://www.tvtechnology.com/opinions/managing-lip-sync-265013 — Tier 4 (practitioner; used for operational context, subordinate to the standards where they differ).
- Audio-to-video synchronization, Wikipedia. Orientation only — used to locate primary sources, never as the source of truth for any standard number. https://en.wikipedia.org/wiki/Audio-to-video_synchronization — Tier 6 (orientation only).
- N. Vatakis and C. Spence, "Audiovisual synchrony perception for speech and music assessed using a temporal order judgment task," Neuroscience Letters, 2006. Confirms content-dependence of asynchrony detection (speech vs music vs object actions), supporting the transient-vs-speech distinction. https://www.sciencedirect.com/science/article/abs/pii/S030439400501092X — Tier 5 (peer-reviewed).
Note on a corrected discrepancy: many popular articles quote lip-sync tolerance as a single symmetric figure (e.g., "±40 ms" or "within one frame"). This article follows the BT.1359-1 normative text, which defines an asymmetric window (+45 / −125 ms detectability), and flags the symmetric framing as an oversimplification that discards the standard's most useful property.


