Why This Matters
Most decisions in a streaming product flow downstream from one number: the latency budget. If you give engineers 20 seconds, they ship classic HLS over any CDN at the lowest cost per viewer. If you give them 3 seconds, they ship low-latency HLS or DASH with a tuned player and a CDN that supports chunked transfer. If you give them 500 milliseconds, they ship WebRTC and accept the bill that comes with it. We wrote this for the smart non-technical reader first — the product manager scoping a live shopping app, the founder briefing engineers, the operator reading a vendor pitch — and tried to make every fact accurate enough that a streaming engineer can still hand it to a colleague. By the last paragraph you will be able to spot the contributor that any specific number is hiding behind, and to call out the protocol that any given target latency actually requires.
The single sentence definition
Streaming latency is the elapsed time between the moment a frame leaves the lens of the camera and the moment that same frame is rendered on the viewer's screen. Industry shorthand calls this glass-to-glass, because the journey starts at one piece of glass (the camera lens) and ends at another (the display). The Streaming Video Technology Alliance and Mux both use the same definition; so do Dolby OptiView, Cloudflare, and the engineers at Apple who wrote the HLS Authoring Specification.
A second term you will hear is end-to-end latency. It means the same thing in streaming context. Some authors use capture-to-display or camera-to-screen; same meaning again. We use glass-to-glass throughout this article because it is the most common and the least ambiguous.
A word of caution. Vendors sometimes quote protocol latency (only the seconds the protocol's player adds), contribution latency (only the seconds from camera to origin), or first-mile latency (only the seconds from camera to ingest). None of those is the glass-to-glass number. When a pitch says "our protocol is 1.5 seconds" without a qualifier, the safe interpretation is that they measured the contributor that flatters their product, not the number a viewer would experience.
The seven contributors to the budget
Glass-to-glass latency is an addition problem. You start with the camera and finish at the display, and you sum the time each stage holds the frame before passing it on. There are seven stages worth naming. The numbers below are realistic 2026 ranges measured in production by Bitmovin, Mux, Wowza, Cloudflare, and AWS Elemental; we will explain each contributor's lower and upper bound in the sections that follow.
| Stage | Typical 2026 range | What it is |
|---|---|---|
| Capture & ingest buffer | 10 – 200 ms | Camera sensor exposure, image-signal-processor pipeline, audio capture, USB or HDMI grabber |
| Encoder | 50 – 400 ms | Compresses raw video into H.264, HEVC, AV1, or VP9; lookahead and B-frame settings dominate |
| Packager (segmenter) | 50 ms – 4 s | Cuts the encoded bitstream into segments or CMAF chunks; the single biggest variable |
| Contribution network | 20 – 200 ms | Round-trip from encoder to origin; depends on geography and protocol (RTMP, SRT, RIST, WHIP) |
| CDN (origin to edge) | 20 – 200 ms | Propagation through origin shield and intermediate caches to the edge node serving the viewer |
| Player buffer | 0.1 s – 30 s | The safety reserve the player keeps ahead of the playhead; almost always the largest term |
| Decoder + display | 30 – 100 ms | Hardware decode, frame queue, panel refresh (16.7 ms at 60 Hz, 8.3 ms at 120 Hz) |
Capture and ingest buffer (10 – 200 ms)
A camera does not produce a digital frame the instant light hits the lens. The sensor needs to expose, the image-signal processor needs to debayer and white-balance, and the device needs to copy the result into a memory buffer the encoder can read. A modern broadcast camera adds 20 to 80 milliseconds. A consumer webcam or a phone camera can add 50 to 200 milliseconds. An action camera with on-sensor compression sits below 20 milliseconds. The audio chain runs in parallel; if it is even one frame out of step with the video, you get lip-sync problems, which we treat as a separate topic.
Encoder (50 – 400 ms)
The encoder turns the raw video into a compressed bitstream. Two settings dominate its latency contribution.
The first is lookahead: how many frames the encoder examines before deciding how to compress the current one. Higher lookahead produces better quality at the same bitrate but costs latency. A typical broadcast encoder uses 10 to 25 frames of lookahead, which at 30 frames per second adds 330 to 830 milliseconds of pure encoder delay. A low-latency encoder profile cuts lookahead to 0 to 5 frames and accepts a small quality penalty in exchange for hundreds of milliseconds of latency.
The second is B-frame depth. B-frames are predicted from both past and future frames, so the encoder has to wait for the future frame to arrive before it can emit the B-frame. A GOP (group of pictures) structure like IBBPBBPBBP adds two frames of delay just by existing. Low-latency encoder profiles use IPPP with no B-frames at all, removing this term entirely.
The math out loud: at 30 frames per second, one frame is 33.3 ms. A IBBPBBP GOP holding the encoder back by 2 frames adds 33.3 × 2 = 66.7 ms. Adding 10 frames of lookahead on top adds 33.3 × 10 = 333 ms. Total encoder delay: ≈ 400 ms.
Packager (50 ms – 4 s)
The packager takes the encoder's compressed bitstream and cuts it into the units the delivery protocol expects: HLS or DASH segments of 2 to 10 seconds; CMAF chunks of 200 to 500 milliseconds; or, for WebRTC, individual RTP packets that are not "packaged" at all.
This is the largest controllable contributor and the one engineers actually fight over. A classic HLS configuration uses 6-second segments, and the packager cannot emit a segment until the encoder has produced six full seconds of video. That single decision adds up to 6 seconds of latency before a single byte leaves the packager. Low-latency HLS and low-latency DASH solve this by emitting partial segments (HLS) or chunked CMAF (DASH) of around 200 to 500 milliseconds, so the packager forwards data as soon as it has a chunk's worth, not a segment's worth. RFC 8216 (the original HLS specification, published 2017) defined only full-segment HLS; the extension to partial segments is defined in Apple's HLS Authoring Specification and in draft-pantos-hls-rfc8216bis-22 (May 2026) via EXT-X-PART-INF and PART-HOLD-BACK. ISO/IEC 23009-1:2022 (DASH) and DASH-IF's "Low-Latency Modes for DASH" guideline define the analogous chunked-CMAF mechanism for MPEG-DASH.
WebRTC does not have a packager in this sense. Each compressed frame is split into Real-time Transport Protocol (RTP) packets and pushed to the network the moment the encoder finishes. There is no segment boundary, so there is no segment-derived delay; that is the single largest reason WebRTC reaches sub-second latency where HLS cannot.
Contribution network (20 – 200 ms)
The contribution network is the leg from encoder to origin server. It is what RTMP, SRT, RIST, WHIP, and WebTransport are doing when they push a stream from a publisher into the streaming platform. We cover this leg in detail in the dedicated articles on push vs pull, contribution vs distribution and on each ingest protocol. The relevant fact for our latency budget: a healthy contribution leg adds the network round-trip plus the protocol's own acknowledgement window — typically 50 to 150 ms across the public internet, lower on a dedicated link. SRT and RIST add a tuneable error-recovery window on top, usually 50 to 250 ms.
CDN (20 – 200 ms)
The content delivery network — the chain of servers that bring the stream close to each viewer — is, surprisingly, rarely the bottleneck. A well-configured CDN with origin shielding and tiered caching adds 20 to 60 milliseconds for a warm edge and 100 to 200 milliseconds for a cold-edge miss. The bigger problem with CDNs and latency is not the propagation time but the chunked transfer encoding support. A CDN that cannot forward a partial HTTP response to the player as soon as the origin emits a chunk cannot deliver low-latency HLS or low-latency DASH at all, regardless of how fast its links are. We will return to this in the article on CDN for the streaming engineer.
Player buffer (0.1 s – 30 s)
This is the term that swallows everything else. The player buffer is the safety reserve the player keeps ahead of the current playhead, so it can ride out the next network burp, the next CDN cache miss, the next viewer's elevator entering a metal box. Long buffer means safe playback; short buffer means current playback. You cannot have both.
Classic HLS recommends a player buffer of three segments. With 6-second segments, that is 18 seconds. With 4-second segments, 12 seconds. Apple's HLS Authoring Specification controls this via the HOLD-BACK attribute on EXT-X-SERVER-CONTROL, with a default of three times the target duration. Low-latency HLS uses PART-HOLD-BACK instead, which Apple defines as "at least twice the Part Target Duration and SHOULD be at least three times the Part Target Duration" — so with 333-millisecond parts, the holdback floor is 666 ms to 1 s. DASH's MPD@suggestedPresentationDelay plays the same role; DASH-IF's Restricted Timing Model (snapshot 24 October 2024) defines it as the presentation delay that "decreases the time shift buffer by moving its end point into the past, creating an effective time shift buffer with a reduced duration".
WebRTC's player buffer is not a buffer at all in the HLS sense; it is the jitter buffer NetEQ (audio) and the video jitter buffer (video) maintain, typically 50 to 200 milliseconds, and tunable in browsers via RTCRtpReceiver.jitterBufferTarget from the W3C WebRTC-extensions draft. That is two orders of magnitude smaller than HLS's player buffer, and it is the second-largest reason WebRTC reaches sub-second latency.
Decoder and display (30 – 100 ms)
The decoder pulls the compressed bytes out of the player buffer, decompresses them back into raw frames, and queues them for the display. Hardware decoders on a phone or a smart TV add 1 to 3 frames of pipeline; software decoders on a browser add a few milliseconds more. The display itself draws the frame at its refresh rate — 16.7 ms per refresh at 60 Hz, 8.3 ms at 120 Hz — and on consumer TVs adds a panel-side processing delay that game-mode disables but movie-mode does not. Total for this stage on a typical phone or laptop in 2026: 30 to 60 ms. Total on a smart TV in default mode: 60 to 100 ms.
A worked example: 25 seconds becomes 0.4
Pick a configuration and add up the seven contributors. Classic HLS first, then WebRTC, on a 1080p sports feed travelling from a stadium in London to a viewer in Berlin.
Classic HLS, 6-second segments:
Camera & ISP = 80 ms
H.264 encoder, 10-frame LA = 333 ms
Packager, 6-s segment = 6,000 ms ← waits for one segment
Contribution (RTMP/SRT) = 100 ms
CDN warm edge = 40 ms
Player buffer, 3 × 6 s = 18,000 ms ← three-segment safety
Decoder + 60 Hz display = 50 ms
──────────
Total glass-to-glass ≈ 24.6 s
WebRTC, default jitter buffer:
Camera & ISP = 80 ms
H.264 encoder, no lookahead = 100 ms
No packager (RTP packets) = 0 ms
Contribution (RTP/SRTP) = 80 ms
CDN: SFU forward = 30 ms
Jitter buffer = 100 ms
Decoder + 60 Hz display = 30 ms
──────────
Total glass-to-glass ≈ 0.42 s
The 60-fold gap between these two numbers has almost nothing to do with the network and almost everything to do with two decisions: how the publisher packages the stream, and how much buffer the player holds. The encoder, contribution, CDN, and decoder contributions are within a few hundred milliseconds of each other in both stacks. The two terms that dominate are packager and player buffer — and both are protocol choices, not technology limits.
What "low latency" actually means
The industry uses the word low with three different floors. Knowing which one a vendor means is the difference between a working pitch and a wasted quarter.
Reduced latency — 5 to 10 seconds. The first wave of HLS and DASH tuning: shorter segments (2 to 4 seconds), smaller buffer (2 to 3 segments), no encoder lookahead. Achievable without changing the protocol family. Used by most OTT live sports and live news in 2026.
Low latency — 2 to 5 seconds. The chunked-CMAF, partial-segment regime: LL-HLS (Apple HLS Authoring Specification), LL-DASH (DASH-IF Low-Latency Modes), High-Efficiency Streaming Protocol (HESP). Requires a CDN that supports HTTP chunked transfer all the way to the player, an origin that emits partial segments, and a player that knows what to do with them. Used by Twitch's "Low Latency" mode, by Mux's low-latency stack, and by most live betting and live shopping in 2026.
Ultra-low / real-time latency — under 1 second. The WebRTC regime, plus Media over QUIC (still drafting in draft-ietf-moq-transport-NN as of 2026), plus HESP for some operators. Real-time latency is qualitatively different from low latency: a viewer can talk back, click, vote, or trade on what they see. The cost per viewer is also qualitatively different — typically 2 to 10 times the dollar-per-minute of LL-HLS at the same audience size.
Common pitfall. Vendors love to quote the lowest latency tier their stack can hit, not the latency tier their typical customer ships in production. Twitch's "Low Latency" mode is 2 to 5 seconds on the receiver side, but the contributing streamer's RTMP leg adds 2 to 5 seconds on top, so the actual glass-to-glass for a chat watcher is closer to 7. When you read a number, ask which legs it includes.
How to measure the number
You cannot tune what you cannot measure. There are four methods worth knowing, ordered from easy and inaccurate to hard and rigorous.
The stopwatch-in-frame method. Point the camera at a phone running a 1-millisecond stopwatch, then take a single photograph that captures both the phone (in the foreground) and a monitor showing the streamed image of the same phone (in the background). The difference between the two displayed times is the glass-to-glass latency. Fast, cheap, accurate to ±1 frame for a single point measurement, but unrepresentative of typical viewer experience because it samples once.
Timecode-in-pixel method. Have the publisher burn a SMPTE timecode (or a simple monotonic counter rendered as digits) into the top-left corner of every frame. On the receiver side, OCR the burnt-in code from the displayed frame and compare it to the current wall-clock time. Repeatable, scriptable, accurate to one frame.
LED-and-photodetector method. Pulse an LED in front of the camera lens. Wire a photodetector to the receiver's display. Trigger an oscilloscope on both. Read the gap. Microsecond-accurate; used by RidgeRun and FPV manufacturers to benchmark their grabbers. Overkill for a streaming product; useful when contractually obligated to prove a number.
Producer-reference-time (prft) injection. For HLS and DASH stacks, the encoder can inject an ISO/IEC 14496-12 prft (Producer Reference Time) box into every CMAF chunk, and the player can compare it to the wall clock at render. The DASH-IF Low-Latency Modes guideline recommends this approach for production telemetry. Continuous, automatic, and the basis of the per-session latency numbers Mux Data and Conviva report.
For most teams, timecode-in-pixel is the right answer. It is accurate enough to debug a configuration, simple enough to script, and produces a single number per measurement that can be charted over time.
Latency vs the other three things it gets confused with
Three words sound like latency but are not it. Mixing them up is the most common mistake we see in product-team Slack threads.
Start-up time is the gap between the viewer clicking play and the first frame appearing. It is not the same as latency. A stream can have 25-second glass-to-glass latency but a 2-second start-up time (DASH and HLS), or it can have 0.4-second latency and a 4-second start-up (WebRTC negotiating ICE candidates). The two numbers move independently; both matter to viewers but for different reasons. We discuss start-up time alongside other quality-of-experience metrics in the article on QoE metrics for streaming.
Rebuffer ratio is the fraction of total watch time the viewer spent looking at a spinner. It is the consequence of running the player buffer too short for the network conditions. Cutting latency without raising the bitrate ladder's robustness almost always raises rebuffer ratio; the two numbers are on a curve, not on a slider. Vendors who quote a low-latency number without a rebuffer-ratio number alongside it are quoting half the story.
Jitter is the variation in the time between successive packets arriving at the player. High jitter forces the player to keep a larger buffer, which costs latency. So jitter is an input to the latency budget, not a synonym for it. We cover jitter in detail in the article on bandwidth, throughput, jitter, packet loss: the network reality.
The 2026 protocol latency table
If you only take one thing from this article, take this table. It maps the latency floor each protocol family hits in real production, plus the conditions required to hit it.
| Protocol family | Glass-to-glass floor | Glass-to-glass typical | Required conditions |
|---|---|---|---|
| HLS (RFC 8216, 6-s segments) | 18 s | 20–30 s | Defaults only; any CDN |
| Classic DASH (ISO/IEC 23009-1) | 12 s | 15–25 s | 2–4 s segments; any CDN |
| LL-HLS (Apple HLS Authoring Spec) | 1.5 s | 2–5 s | Chunked transfer in CDN; partial segments; tuned player |
| LL-DASH (DASH-IF Low-Latency Modes) | 1.5 s | 2–5 s | Chunked CMAF; chunked transfer in CDN; tuned player |
HESP (draft-theo-hesp-NN) | 0.4 s | 0.6–1.2 s | HESP-aware origin and player |
| WebRTC (W3C CR + RFC 8825–8866) | 0.2 s | 0.3–0.8 s | SFU or P2P topology; tuned jitter buffer |
Media over QUIC (draft-ietf-moq-transport-NN) | 0.2 s | 0.4–1 s | MoQ-capable relay and player; work in progress as of May 2026 |
SRT (draft-sharabayko-srt-NN) | 0.25 s | 0.5–2 s | Contribution only; not a delivery protocol to consumer devices |
| RIST (SMPTE TR-06-1/2/3) | 0.25 s | 0.5–2 s | Same as SRT |
Where Fora Soft fits in
We have built live and near-live video for OTT, live commerce, telemedicine, e-learning, video surveillance, and AR/VR clients since 2005. Latency is the design constraint our streaming engineers fight over first: we have shipped LL-HLS broadcast stacks for OTT operators with chunked-CMAF origins and tuned hls.js / Shaka players, WebRTC stacks for telemedicine and live shopping with mediasoup and LiveKit SFUs, and SRT-and-RIST contribution legs for broadcast partners. The latency budget in each case is a written document — capture, encoder, packager, contribution, CDN, player, decoder — agreed before any code is written, with a target glass-to-glass number and a measured-on-staging confirmation before launch.
What to read next
- Live vs VOD vs near-live: three problems that look the same but are different
- Bandwidth, throughput, jitter, packet loss: the network reality
- HLS in depth: m3u8, segments, multi-variant playlists
Talk to us
- Talk to a streaming engineer — book a 30-minute scoping call with the Fora Soft streaming team.
- See our case studies — forasoft.com/projects.
- Download the latency budget cheat sheet — ./downloads/latency-budget-cheat-sheet.pdf — a one-page reference of the seven contributors, with the 2026 protocol floors and the worked example from this article.
References
- IETF RFC 8216, HTTP Live Streaming (R. Pantos, W. May, August 2017) — the original HLS specification. Defines the playlist format,
EXT-X-TARGETDURATION, segment semantics, and the original three-segment hold-back rule. Cited for the 18-second classic-HLS player-buffer floor. - IETF
draft-pantos-hls-rfc8216bis-22, HTTP Live Streaming 2nd Edition (R. Pantos et al., 1 May 2026) — the in-progress successor to RFC 8216. DefinesEXT-X-SERVER-CONTROL,HOLD-BACK,PART-HOLD-BACK,CAN-SKIP-UNTIL, and the low-latency extensions. Internet-Draft; subject to change before RFC publication. - Apple Inc., HTTP Live Streaming (HLS) Authoring Specification for Apple devices, revision 2025-09. Apple's normative additions on top of the RFC for the Apple ecosystem, including the LL-HLS partial-segment regime and the
PART-HOLD-BACKfloor of two-to-three times the Part Target Duration. - ISO/IEC 23009-1:2022, Information technology — Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats. Defines
MPD@suggestedPresentationDelayand the segment-availability timing model. Cited for the classic-DASH player-buffer floor. - DASH Industry Forum, DASH-IF implementation guidelines: restricted timing model, snapshot 24 October 2024. The authoritative explanation of presentation delay, time shift buffer, and the relationship between
MPD@suggestedPresentationDelayand end-to-end latency. We rely on this for the DASH presentation-delay definition; the document explicitly notes thatMPD@minBufferTimeis not a latency knob. - DASH Industry Forum, Low-Latency Modes for DASH (CTA-5004) implementation guideline. Defines the
ServiceDescriptionelement'sLatencytarget/min/max, theProducerReferenceTime(prft) box for telemetry, and the chunked-CMAF emission model. Referenced for the LL-DASH 1.5-second floor. - W3C, WebRTC: Real-Time Communication Between Browsers (Candidate Recommendation Snapshot, latest revision). Specifies the receiver-side jitter buffer model. W3C, WebRTC Extensions (Working Draft) defines
RTCRtpReceiver.jitterBufferTarget— the standardised knob for trading latency against safety. - IETF RFC 8825–8866 family (WebRTC), notably RFC 8829 (SDP), RFC 8825 (Overview), RFC 5764 (DTLS-SRTP). The transport-layer foundation for WebRTC; cited for the contribution and jitter-buffer ranges in the WebRTC budget.
- The Chromium Project, NetEQ audio jitter buffer design document (current revision, 2024). Documents the adaptive-jitter-buffer algorithm that underlies the audio half of WebRTC's player-side latency contribution.
- Mux, Low-Latency Video Streaming: A Complete Guide (continuously updated). One of the strongest open-internet references on the seven contributors; we used the production-range numbers as a cross-check against our own measurements. Where Mux's numbers differ from the spec's, the spec wins (per our source hierarchy).
- Streaming Video Technology Alliance, university resource page on RFC 8216 and the low-latency landscape. Used for the latency-tier definitions (reduced, low, ultra-low).
- Apple Inc., What's new in HTTP Live Streaming, WWDC 2025 session document (Apple Developer). Cited for the September 2025 revision of the HLS Authoring Specification, including the removal of HTTP/2 push from LL-HLS in 2023 and subsequent revisions.
Annotation on disagreement: some popular vendor blogs describe LL-HLS as requiring HTTP/2 server push. Apple removed HTTP/2 push from the HLS Authoring Specification in the September 2023 revision; current LL-HLS uses blocking playlist reload plus preload hints. The article follows the spec (Apple HLS Authoring Specification revision 2025-09) and notes the discrepancy.


