Why this matters
You can ship a perfect encode and still deliver a broken experience, and the number on your dashboard will not warn you. A full-reference metric like VMAF grades the pixels of a rendition; it never sees the four-second spinner, the resolution that popped soft mid-scene, or the block of garbage that flashed when a packet dropped — because none of those live in the file it scored. This article is for a streaming or encoding lead, a player or WebRTC engineer, or a QA engineer who has watched a stream stall, oscillate, or tear and needs to know exactly what happened, why it happened downstream of the encoder, and which measurements actually catch it. Get this right and you stop letting a file-based picture score certify an experience it never observed — and you start measuring the session the viewer actually got.
A different class of artifact
Every other article in this gallery — blocking, banding, ringing, blur, color bleeding — describes damage the encoder did when it threw away data to hit a bitrate. The artifacts here are different in origin: the encode can be flawless and the viewer still gets a bad experience, because the damage happens after the file leaves the encoder, in the delivery path and the player. That single fact changes how you measure them.
There are three to name, and they split cleanly by what kind of transport carried the bits.
Switching is the visible change when a player jumps between quality levels. Freezing — more precisely, stalling or rebuffering — is when playback stops while the player waits for data, the spinner the viewer hates most. Tiling is blocky corruption: regions of the frame replaced by garbage or by stale content because a piece of the bitstream never arrived intact. The first two are what network trouble looks like when the transport is reliable; the third is what it looks like when the transport is not.
Two delivery regimes decide which artifact you get
To understand these artifacts you have to understand one fork in the road: whether the bits travel over a reliable transport that re-sends what gets lost, or an unreliable one that does not. The choice decides whether a network glitch becomes a stall or a smear.
Reliable transport — the Transmission Control Protocol (TCP), or the newer QUIC — guarantees delivery. Every packet is acknowledged; anything lost is retransmitted before the data is handed to the decoder (the streaming-protocols literature, e.g. Ant Media, 2026). HTTP Adaptive Streaming — HLS and MPEG-DASH, the technology behind almost all video-on-demand and most live OTT — runs over exactly this. So a lost packet on an HLS stream is never a visible corruption: the decoder only ever receives a complete, correct bitstream. The damage shows up a different way. Retransmission takes time, and TCP's congestion control slows the send rate when it detects loss, so the bytes arrive late. If they arrive later than the player's buffer can cover, one of two things happens: the player drops to a lower-bitrate rendition to keep up (a switch), or the buffer empties and playback halts (a freeze). Reliable transport converts packet loss into time problems, not pixel problems.
Unreliable transport — the User Datagram Protocol (UDP), usually carrying the Real-time Transport Protocol (RTP) — does not retransmit, because waiting for a re-send would blow the latency budget. WebRTC conferencing, low-latency contribution feeds, and traditional broadcast/IPTV all use it. Here a lost packet is gone, and if the decoder cannot conceal it, you see the loss directly as tiling and corruption. The trade is deliberate: real-time media accepts the occasional visible glitch in exchange for the sub-second latency that makes a conversation feel natural.
Figure 1. The fork that decides the artifact. Over reliable transport (TCP/QUIC, under HLS/DASH), lost packets are retransmitted, so the decoder always gets a clean bitstream — and network trouble becomes a rebuffering freeze or an ABR switch. Over unreliable transport (UDP/RTP, under WebRTC and broadcast), there is no retransmission, so an unconcealed loss becomes visible tiling that propagates until the next keyframe.
This split is the most useful thing to hold in your head, because it tells you which artifact to expect from which product. A VOD app stalls and switches; it does not tile. A video call tiles and freezes; it rarely "rebuffers" in the VOD sense. Measure accordingly.
Freezing: the artifact with no wrong pixels
Start with the one that breaks metrics most completely. A stall — also called rebuffering or a freeze — is the event where playback stops because the player's buffer ran dry, and the last frame sits frozen on screen (often under a spinner) until enough data arrives to resume. The first stall every session can have is startup delay: the initial buffering before the first frame, also called time-to-first-frame.
Here is why it defeats a picture metric. During a freeze, every frame on screen is a correct, fully-decoded frame — it is simply the same correct frame, held too long. A full-reference metric like PSNR, SSIM, or VMAF compares decoded frames to the reference and finds nothing wrong, because nothing is wrong with the pixels. The fault is in time, not in the image, and a per-frame picture score has no axis for time. As the standardized streaming model's own documentation puts it, its session integration models stalling and quality fluctuation "especially in comparison to other video-quality only models (e.g., VMAF)" (AVEQ, ITU-T P.1203 overview, 2025). The picture metric is not wrong; it is answering a different question.
Freezing is also the artifact with the hardest business numbers attached, which is why it gets measured even when nothing else does. The canonical study is Krishnan and Sitaraman's analysis of 23 million views from 6.7 million viewers on Akamai's network (Internet Measurement Conference, 2012). Two findings anchor every rebuffering budget since:
- A viewer who suffers rebuffering equal to 1% of the video's duration watches about 5% less of the video than a comparable viewer with no rebuffering.
- Viewers begin abandoning a video once startup exceeds about 2 seconds, and each additional second of startup delay raises the abandonment rate by roughly 5.8%.
Those are not picture-quality numbers. They are delivery numbers, and they are the reason a streaming team that ignores stalls is optimizing the wrong thing.
Switching: when the picture is fine but it keeps changing
Adaptive Bitrate streaming (ABR) is the mechanism that keeps a stream alive on a variable network: the player continuously picks the highest-quality rendition the current bandwidth can sustain, switching up when the network improves and down when it degrades. Switching is what keeps you out of a freeze. But the switch itself is perceptible, and past a point it becomes its own artifact.
Two flavors of switching annoy a viewer. A downshift is the obvious one: the picture suddenly goes soft as the player drops to a lower resolution or bitrate, most visible on detailed or high-motion content. Oscillation is the subtler one: the player flips up and down between levels repeatedly, so the picture never settles, and the constant change is more distracting than a stable lower quality would be. The research consensus on adaptive streaming names four QoE factors, and quality switching sits alongside the others as a first-class impairment: initial delay, stalling, the average quality, and the switching itself (Barman and Martini, survey of ABR QoE, 2019; the QABR line of work, 2019).
This is the other reason a single picture score misleads. Run VMAF on each rendition of your ladder and every rung can score beautifully — because each rung is a clean encode. The artifact is not in any rendition; it is in the sequence of renditions the player strung together, and in how often and how far it jumped. A per-rendition score has no way to represent that. The standardized model handles it by integrating quality second-by-second across the whole session, so a stream that averages "good" but lurches between levels is scored below a stream that holds steady — which is exactly how a human rates it.
Figure 2. The session view a picture metric cannot produce. Along the playback timeline: a startup delay before the first frame, a mid-session rebuffering freeze, and several ABR switches between renditions. The QoE metrics that grade the session — startup time, rebuffering ratio, stall count, and a switching index — are derived from the timeline, not from any single frame. A full-reference picture score has no time axis and sees none of it.
Tiling: what an unconcealed packet loss looks like
Now the pixel-side artifact. On unreliable transport, a lost packet means a chunk of the encoded picture never arrives — and because video codecs pack the frame into blocks (macroblocks in H.264, coding tree units in HEVC and AV1), the missing data shows up as tiling: rectangular regions of the frame filled with garbage, frozen stale content, or a flat smear where the real content should be. It looks like a torn mosaic, and it is the unmistakable signature of delivery loss rather than compression.
Two things make tiling worse than a single damaged block, and both come from how codecs compress across time. First, most frames are not stored whole; they are stored as differences from other frames. A group of pictures (GOP) starts with a self-contained keyframe — an I-frame, or more precisely an IDR frame — and is followed by predicted frames (P- and B-frames) that only encode what changed. Lose part of a reference frame and the error does not stay put: it propagates to every later frame that predicts from it, smearing and dragging forward until the next keyframe arrives and resets the picture (Huszák and Imre, analysis of GOP structure and packet-loss propagation, 2010). Lose part of the keyframe itself and the entire GOP is damaged.
The arithmetic is worth seeing once. With a common 2-second keyframe interval at 30 frames per second, one corrupted reference frame can visibly damage up to 2 × 30 = 60 frames — two full seconds of playback — before the next keyframe cleans it up. That is why a single lost packet can produce a smear that lasts far longer than a single frame, and why keyframe frequency is one of the levers for limiting loss damage.
Second, decoders try to hide the loss, and the concealment has its own look. Error concealment replaces a missing block either by copying the matching region from the previous frame (temporal concealment) or by interpolating from neighboring blocks (spatial concealment). When the temporal guess is wrong — the content moved — you get the characteristic smearing or dragging, where a patch of the previous frame is stamped into the wrong place. Modern real-time stacks often take the safest concealment of all: WebRTC's browser implementations freeze on the last good frame until a keyframe arrives rather than show torn content (getstream.io and bloggeek.me, WebRTC media resilience, 2025). So even on the unreliable-transport path, the modern default frequently converts a would-be tile into a brief freeze — the two artifacts are closer cousins than they look.
Figure 3. Tiling and its propagation. A packet loss damages blocks in a reference frame (the tiled region). Because later P- and B-frames predict from it, the error smears forward through the group of pictures — up to ~60 frames at a 2-second, 30 fps keyframe interval — until the next keyframe (IDR) resets the picture. Error concealment can hide or freeze the damage; when its temporal guess is wrong, you see dragging instead.
Why your picture metric is blind to all three
The thread through this article is one measurement truth: a full-reference picture metric measures the encode, not the session. Remember that a full-reference metric needs the pristine original and compares it, frame by frame, to a decoded copy — the setup behind PSNR, SSIM, and VMAF (see full-reference, reduced-reference, and no-reference). That setup cannot see any of the three artifacts here, for three different reasons.
A freeze has no wrong pixels, so a frame-by-frame comparison finds no error — the metric has no time axis to notice that a correct frame was held too long. A switch is invisible because each rendition, scored on its own, is a clean encode; the artifact lives in the sequence and the jumps, which a single per-rendition score does not represent. And tiling lives in the decoded playback on the viewer's device, not in the encoded file you usually run the metric against — so a metric computed on the pristine master or the clean encode never encounters it at all. To catch tiling with an objective metric you would have to run a no-reference metric on the actual impaired decode, because on a live or corrupted stream there is no clean reference to compare to (no-reference quality for live and UGC). And as always, pooling hides the rest: average a two-second smear across a ten-minute session and the mean barely twitches.
Common mistake: certifying the experience with a file-based picture score. The expensive error is to run VMAF across your encoding ladder, see every rung score 95+, and sign off on "quality." That score validated the encode. It says nothing about whether viewers got stalls, watched the picture oscillate, or saw a packet-loss tear — because none of that is in the files you measured. A stream can be 96 VMAF on every rendition and still bleed watch time to a 1% rebuffering ratio. If delivered experience is in scope, a picture metric is necessary but not sufficient: pair it with session-level QoE measurement and player-side delivery metrics, or you are grading the half of the pipeline that was probably already fine.
How to actually measure switching, freezing, and tiling
If the file-based picture metrics are blind here, what sees these artifacts? Three layers, in rough order of how close they sit to the viewer.
Player-side delivery metrics — the ground floor. The cheapest and most direct measurement is to instrument the player and count the events: startup time, number and duration of stalls, the rebuffering ratio, and the number and size of ABR switches. The industry has a standard vocabulary for exactly this. CTA-2066, the Consumer Technology Association's Streaming Quality of Experience Events, Properties and Metrics, defines a common set of player events and QoE metrics — availability, startup time, continuity (the stall family), and quality — so that different players and analytics vendors report the same thing the same way (CTA WAVE R4 WG20, public version). These are the numbers your dashboard should carry next to VMAF, not instead of it.
Session QoE models — the synthesis. To turn those events into a single experience score, the controlling standard is ITU-T P.1203, the first standardized QoE model for HTTP adaptive streaming. It predicts a Mean Opinion Score (MOS, the 1-to-5 human rating scale) for a whole session of up to five minutes, and — this is the point — it explicitly folds in compression quality, spatial and temporal switching, initial loading delay, and stalling, integrating them second by second through its Pq module (ITU-T P.1203, 2017; validated on 1,000+ sequences and over 25,000 human ratings, Robitza et al., 2018). Notably, P.1203 is not a full-reference metric: it works from stream metadata and the bitstream — codec, bitrate, resolution, frame types and sizes, plus the client's stalling events — so it needs no pristine reference and runs without decoding pixels. Its successor video model, ITU-T P.1204.3 (2020), is a bitstream model validated to 4K for H.264, HEVC, and VP9, and its per-second outputs can feed the same session integration. This family is how you put a defensible number on a streaming session, stalls and switches included.
No-reference picture metrics — for the tiling case. When the artifact is decode-side corruption with no clean reference — live, UGC, a captured playout — you need a blind metric that scores the impaired picture directly, looking for the blocky-edge discontinuities and frozen-frame runs that signal loss. This is the no-reference toolbox (covered in no-reference quality for live and UGC); the companion script below is a deliberately simple version of the idea.
| Artifact | What causes it | Where it lives | Picture metric (PSNR/SSIM/VMAF) | How to actually measure it |
|---|---|---|---|---|
| Switching | ABR moving between renditions on a changing network | The sequence of renditions | Blind — each rendition scores fine alone | ITU-T P.1203 session MOS; player switch count + magnitude (CTA-2066) |
| Freezing / stall | Buffer underrun on reliable transport; startup delay | Time, not pixels | Blind — every frame is a correct frame | Rebuffering ratio, stall count/duration, startup time (CTA-2066); P.1203 |
| Tiling / corruption | Unconcealed packet loss on unreliable transport | The decoded playback, not the file | Blind — not present in the encoded file scored | No-reference metric on the decode; player loss/error events |
Table 1. Three streaming artifacts the encoder did not cause, and why a file-based picture metric misses each. Switching hides in the sequence, freezing hides in time, and tiling hides in the decoded output the metric never sees. Measure them with session models (P.1203/P.1204), player-side metrics (CTA-2066), and no-reference checks on the actual decode.
A worked example: the dashboard that says "excellent"
Make the blind spot concrete with one session. A viewer watches a 10-minute (600-second) video-on-demand title. The encode is your best ladder: every rendition scores VMAF 96. The session itself goes like this: a 3-second startup delay before the first frame, then two rebuffering stalls of 4 and 2 seconds, and five downward ABR switches as the viewer's train passes through tunnels.
Compute the delivery numbers the picture score never touched. Total stall time is 4 + 2 = 6 seconds. The rebuffering ratio — stall time divided by stall-plus-play time — is 6 ÷ (6 + 600) = 6 ÷ 606 ≈ 0.99%, essentially the 1% threshold from the engagement study. Apply that study's finding and this viewer is on track to watch about 5% less of the title than a stall-free viewer — call it 30 fewer seconds of a 10-minute video, gone to a spinner. Separately, the 3-second startup is one second past the 2-second abandonment threshold, so by the same study roughly 5.8% more viewers like this one will have bailed before the title even started. The five switches, meanwhile, mean the picture visibly softened and re-sharpened five times, each one a small annoyance the per-rendition VMAF 96 cannot represent.
One session, two verdicts. "Excellent" if you read the encoding ladder; "leaking watch time and patience" the moment you measure the session. The picture metric did not lie about the encode — it was simply never asked about the delivery. (The 5% and 5.8% figures are the published Krishnan-Sitaraman elasticities, applied illustratively; your content and audience will differ — measure your own.)
Figure 4. The two-verdict session. The encoding ladder scores VMAF 96 on every rendition — a flawless picture verdict. The same session, delivered, carries a 3-second startup, two stalls (a ~1% rebuffering ratio), and five switches: a leaking-watch-time verdict. A full-reference picture metric measures the left side; session models (P.1203) and player metrics (CTA-2066) measure the right. You need both.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming and OTT, WebRTC conferencing, e-learning, telemedicine, and surveillance — and the two delivery regimes in this article map directly onto those products. A WebRTC conferencing or telemedicine call lives on UDP/RTP, so its failure mode is packet-loss tiling and concealment freezes, and the right instrumentation is loss, jitter, and freeze events at the client plus a no-reference check on the decoded picture. An OTT or e-learning VOD app lives on HLS/DASH over TCP, so its failure mode is startup delay, rebuffering, and ABR switching, and the right instrumentation is the CTA-2066 event set fed into a P.1203-style session score. We treat delivered experience as a separate measurement problem from encode quality: a picture metric on the ladder proves the file is good, and session-and-player QoE proves the viewer got it good — and only the pair tells you the truth. Where the artifact traces back to a pipeline choice, the cause-side fix lives one step upstream, in the delivery and ABR design covered by our streaming work.
What to read next
- Streaming QoE: The Metrics That Predict Whether a Viewer Stays
- Rebuffering Ratio and the Cost of a Stall
- Tracing an Artifact to Its Cause in the Pipeline
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your streaming video artifacts plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.1203, "Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport," International Telecommunication Union, 2017 (with P.1203.1, P.1203.2, P.1203.3). Tier 1 (official standard). The first standardized QoE model for HTTP adaptive streaming; predicts session MOS (1–5) and explicitly integrates compression quality, spatial/temporal switching, initial loading delay, and stalling via the Pq module; metadata/bitstream-based, not full-reference. Basis for the session-model measurement claims. https://www.itu.int/rec/T-REC-P.1203
- Recommendation ITU-T P.1204.3, "Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full bitstream information," International Telecommunication Union, January 2020. Tier 1 (official standard). Bitstream mode-3 video model for HAS over TCP/QUIC, validated to 4K for H.264, HEVC, and VP9; per-second outputs feed a P.1203-style session integration; developed with VQEG. Basis for the P.1204 paragraph and the reliable-transport framing. https://www.itu.int/rec/T-REC-P.1204.3-202001-I/en
- CTA-2066, "Streaming Quality of Experience Events, Properties and Metrics," Consumer Technology Association (CTA WAVE R4 WG20), public version. Tier 1 (official standard). Standardizes player events and QoE metrics — availability, startup time, continuity (stalls/rebuffering), and quality — for consistent reporting across players and analytics vendors; does not define transport. Basis for the player-side metric vocabulary. https://github.com/cta-wave/R4WG20-QoE-Metrics
- S. S. Krishnan and R. K. Sitaraman, "Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs," ACM Internet Measurement Conference (IMC), 2012. Tier 5 (peer-reviewed). Analysis of 23M views from 6.7M viewers on Akamai; establishes that a rebuffering delay equal to 1% of video duration reduces play time by ~5%, and that each 1 s of added startup delay past ~2 s raises abandonment by ~5.8%. Basis for the engagement numbers and the worked example. https://people.cs.umass.edu/~ramesh/Site/HOME_files/imc208-krishnan.pdf
- W. Robitza, S. Göring, A. Raake, et al., "HTTP Adaptive Streaming QoE Estimation with ITU-T Rec. P.1203 — Open Databases and Software," ACM Multimedia Systems Conference (MMSys), 2018. Tier 5 (peer-reviewed / reference implementation). Documents the open-source P.1203 implementation and its validation (correlation ~0.85–0.9 with subjective MOS); details the modular Pv/Pa/Pq structure and the four operating modes. Basis for the P.1203 module and accuracy claims. https://dl.acm.org/doi/10.1145/3204949.3208124
- Á. Huszák and S. Imre, "Analysing GOP Structure and Packet Loss Effects on Error Propagation in MPEG-4 Video Streams," International Symposium on Communications, Control and Signal Processing (ISCCSP), 2010. Tier 5 (peer-reviewed). Shows how a packet loss in a reference frame propagates through the GOP via prediction until the next I-frame, and how GOP length trades compression efficiency against error-propagation extent. Basis for the propagation arithmetic and the I-frame-loss claim. http://www.hit.bme.hu/~huszak/publ/Analysing%20GOP%20Structure%20and%20Packet%20Loss%20Effects%20on%20Error%20Propagation%20in%20MPEG-4%20Video%20Streams.pdf
- N. Barman and M. G. Martini, "QoE Modeling for HTTP Adaptive Video Streaming — A Survey and Open Challenges," IEEE Access, vol. 7, 2019. Tier 5 (peer-reviewed). Surveys ABR QoE and identifies the recurring impairment factors — video quality, startup delay, stalling, and quality switching — establishing switching as a first-class QoE impairment, not a side effect. Basis for the four-factor switching framing. https://ieeexplore.ieee.org/document/8930519
- "Media Resilience in WebRTC" and "Fixing Packet Loss in WebRTC," getstream.io and bloggeek.me, 2024–2025. Tier 6 (educational / practitioner). Document WebRTC's default video concealment — freezing on the last good frame until a keyframe arrives — and the NACK/FEC/PLI resilience tools on the RTP path. Orientation for the concealment-as-freeze behavior on unreliable transport. https://getstream.io/resources/projects/webrtc/advanced/media-resilience/
- "TCP vs UDP: What's the Difference for Video Streaming?" and "Video Streaming Protocols," Ant Media, 2026. Tier 6 (educational). Summarizes reliable (TCP/QUIC, used by HLS/DASH) vs unreliable (UDP/RTP, used by WebRTC) transport and how each handles loss — retransmission-and-rebuffering vs best-effort-and-corruption. Orientation for the two-regimes framing. https://antmedia.io/tcp-vs-udp-video-streaming/


