Where Objective Video Quality Metrics Lie (and Why)

Why this matters

You have met the metrics — PSNR, SSIM, and VMAF — and you can compute them. This article is the one you need next: knowing where each one lies, so you do not ship an encoder change that "improved VMAF" but looks worse, or fail a release that a viewer would have passed. It is for the encoding lead, streaming engineer, or QA engineer who already reports quality numbers and now has to trust them on real, varied content. The single most expensive mistake in measurement is optimizing for a metric on content the metric was never built to judge, and that is exactly what this catalogue helps you avoid.

A metric is a model, and every model has a domain

Start with the one idea that explains every failure in this article. A full-reference quality metric — one that compares a compressed video against the pristine original, the setup covered in the three measurement setups — is not a law of physics. It is a model: a formula, or in VMAF's case a small machine-learning model, that was tuned so its output lines up with scores real people gave in a subjective test. PSNR was derived from signal engineering, SSIM from a theory of how the eye reads structure, and VMAF was literally trained on human ratings of a set of clips.

Every model has a domain — the range of inputs it was built and checked against. Inside that domain it predicts human opinion well. Outside it, the model still produces a number, confidently, but the number no longer means what you think. VMAF's standard model, for instance, assumes a viewer sitting in front of a 1920×1080 display at roughly three times the screen height, which works out to about 60 pixels per degree of vision (Netflix VMAF documentation, 2026). Feed it 4K on a phone, or a screen recording full of text, and you have left the domain.

So "the metric lies" is shorthand for a precise thing: the content broke an assumption the metric was built on, so its score no longer tracks what a viewer would say. The rest of this article is a tour of the assumptions that break most often, and what to do at each one. Remember throughout that the eye — a properly run subjective test — is the ground truth, and the metric is only ever a proxy for it.

The mental model: a metric reports agreement, not appearance

Before the catalogue, one demonstration of why even a flawless metric is not the truth. The oldest metric, PSNR — Peak Signal-to-Noise Ratio, the number that compares two frames pixel by pixel and reports their difference in decibels — is built on mean squared error: average the squared difference of every pixel, and turn it into dB. Its fatal property is that it treats every pixel error as equally important, no matter where it lands or what it looks like.

Watch what that allows. Take a flat gray patch and distort it two ways, each tuned to the identical mean squared error of 100 (on the 0–255 scale):

Distortion A — add 10 to every pixel (a faint global brightness shift)
   squared error per pixel = 10² = 100
   MSE = 100

Distortion B — corrupt 1% of pixels by ±100 (scattered black-and-white speckle)
   squared error contribution = 0.01 × 100² = 100
   MSE = 100

PSNR (both) = 10 × log10(255² / 100)
            = 10 × log10(650.25)
            = 10 × 2.813
            = 28.13 dB

Both distortions score an identical 28.13 dB. Yet a viewer barely notices the global brightness nudge in A, while the impulsive speckle in B is glaring. Same number, opposite experience — and not because PSNR was computed wrong, but because PSNR answers "how much do the pixels disagree?" not "how bad does it look?" This is the classic result behind two decades of better metrics (Wang and Bovik, "Mean Squared Error: Love It or Leave It?", IEEE Signal Processing Magazine, 2009). SSIM and VMAF were invented precisely to close this gap, and they close much of it — but they are still models with domains, and the rest of the article shows where they, too, drift from the eye.

Two distortions of one gray patch — a faint shift and scattered speckle — both labeled PSNR 28.13 dB, yet very different. Figure 1. Identical PSNR, opposite experience. A global +10 shift (left) is nearly invisible; 1% impulse noise (right) is glaring — yet both measure 28.13 dB, because PSNR weights every pixel error equally regardless of how it looks.

The blind-spot catalogue

Each section below names a content type, the assumption it breaks, which metrics drift, and what to do instead. These are not exotic edge cases; they are the everyday content that fills surveillance feeds, sports streams, OTT catalogues, conference calls, and game streams.

Film grain and heavy texture

Film grain — the fine, random noise photographers and colorists deliberately keep for a filmic look — is the cleanest example of a metric measuring the wrong thing. Grain is random by nature, so no two renderings of it line up pixel for pixel. A full-reference metric, which rewards pixel and feature agreement with the original, sees that mismatch as error even when the grain looks perfect to a viewer.

Modern codecs make this worse in a useful way. AV1 and newer codecs can strip the grain before encoding, then synthesize fresh grain of the same character at playback — a trick that saves up to 50% of the bitrate on heavily grained content (Chen et al., "An Overview of Coding Tools in AV1", APSIPA, 2020). The synthesized grain looks right but sits in different pixel positions, so objective metrics penalize it heavily; the AV1 grain work was evaluated with an informal subjective test precisely because objective metrics "don't work well in this case" (Norkin and Birkbeck, "Film Grain Synthesis for AV1", IEEE DCC, 2018). Netflix's own guidance is blunt: synthesized grain interferes with VMAF, and you should disable grain synthesis when computing VMAF (Netflix/vmaf issue #1192, 2026). The same logic applies to any high-texture content — foliage, water, confetti, crowds — where the detail is statistically similar but not pixel-identical.

What to do: when measuring grain-synthesis encodes, disable synthesis at decode and measure the grain-free signal, or compare on the de-grained source so you are measuring the codec, not the grain mismatch. Treat a VMAF drop on grainy content as suspect until you have confirmed it with your eyes.

High motion, and the high-frame-rate trap

Fast motion breaks metrics in two opposite directions, and VMAF's own history shows both. First, the human eye masks fine detail during fast motion — you simply cannot resolve texture on a hard pan or a sports camera whip — so errors that a per-frame metric counts in full are partly invisible to a viewer. VMAF v0 under-trained on high-motion sequences and, with an unbounded motion feature, tended to overpredict quality on very high-motion scenes (Netflix VMAF v1 blog, 2026). Second, and going the other way, v0 measured motion between consecutive frames, so at 60 frames per second the per-frame change looked small and v0 underpredicted quality relative to 24 or 30 fps content. VMAF v1 addresses both with a hard threshold on the motion feature and an option to measure motion over a longer temporal window, but the lesson stands: motion is where a metric's temporal assumptions show.

What to do: keep frame rate constant across any comparison, and never compare a 60 fps score against a 30 fps score as if they were on one scale. If you measure high-motion or high-frame-rate content, use the current VMAF model rather than an old one, and spot-check the fastest scenes by eye.

Dark scenes and shadow detail

Dark scenes are perceptually treacherous and metrics handle them poorly. The eye is highly sensitive to small steps in near-black, where banding and blocking show up first, yet PSNR and SSIM computed on standard luma give a dark frame few bright pixels to "disagree" about, so a large visible error in the shadows can move the score very little. The problem compounds in high dynamic range, where the expanded luminance range puts more meaningful detail in the shadows and highlights; there is no widely deployed public VMAF model for HDR yet, and metrics built for standard range are known to be less sensitive to distortions in dark regions (Netflix VMAF v1 blog, 2026; HDR/SDR subjective-quality research, 2025).

What to do: for dark or HDR content, do not rely on a single luma metric. Inspect shadow regions directly, weight your subjective spot-checks toward the dark scenes, and watch for banding specifically (next section). Treat any HDR quality number from a standard-range model as indicative, not final.

Banding: the artifact the headline metrics cannot see

Banding — the staircase of visible steps that appears where a smooth gradient like a sky or a fade should be continuous — is the textbook blind spot. Netflix states it plainly: "traditional video quality metrics such as PSNR, SSIM or VMAF are not designed to identify banding" (Netflix CAMBI documentation, 2026). The reason is structural: banding replaces a smooth ramp with a few flat plateaus, which barely changes the pixel statistics these metrics track, so the score stays high while the sky visibly steps.

The fix is a specialized detector. Netflix built CAMBI — the Contrast Aware Multiscale Banding Index — exactly because the headline metrics miss banding. CAMBI is a no-reference detector that scores banding frame by frame: 0 means none, around 5 is where banding starts to become annoying, and 24 is unwatchable, with visibility rising the brighter the display and dimmer the room (Netflix CAMBI documentation, 2026). VMAF v1 now folds CAMBI in as one of its features, which is itself the clearest admission that v0 was blind here (Netflix VMAF v1 blog, 2026).

What to do: on any content with skies, gradients, fades, or flat color — and on aggressive low-bitrate encodes — run a banding detector such as CAMBI alongside VMAF, and gate on it separately. A high VMAF and a high CAMBI together mean a clean picture; a high VMAF with a CAMBI of 5+ means a banded one the headline number hid.

A smooth source gradient vs a banded encode, with VMAF reading 95 but CAMBI flagging banding at 6. Figure 4. The artifact the headline metric cannot see. A smooth source gradient becomes flat plateaus in the encode; VMAF still reads 95 while CAMBI flags the banding — and the eye sides with CAMBI.

Chroma and color artifacts

Most quality numbers you have ever read describe brightness only. Standard PSNR is usually reported on the luma channel, SSIM is computed on luminance, and VMAF v0 extracted luma-based features alone, so it was "unaware of chroma artifacts" entirely (Netflix VMAF v1 blog, 2026). Yet encoding and chroma subsampling introduce real color errors — bleeding reds, shifted skin tones, smeared saturated edges — that a luma-only metric cannot see. VMAF v1 added a chroma feature for this reason; older scores simply do not account for color.

What to do: when color fidelity matters — brand colors, skin tones, graphics — measure chroma explicitly (a chroma PSNR on the U and V channels is a cheap start) or use a metric version that includes chroma, and confirm saturated regions by eye.

Screen content, text, and gaming

Metrics learned the statistics of natural, camera-captured video, because that is what they were trained and validated on. Screen content — slide shares, documents, code, UI, maps — and synthetic game output have different statistics: large flat regions, hard high-contrast text edges, repeated patterns, and cursor motion. On this content the perceptual weighting a metric learned no longer matches the eye; a blurred letter that destroys readability may move a natural-content metric far less than its impact warrants. VMAF's makers name "live streaming and cloud gaming" as emerging use cases they are still adapting the metric toward, a tacit acknowledgment that the standard model was not built for them (Netflix VMAF v1 blog, 2026).

What to do: for screen-share, text-heavy, or gaming content, do not trust a single natural-content metric. Spot-check legibility directly, and where the job is screen content, prefer a test method that puts the real content in front of real viewers — the subjective testing block covers how.

Animation and synthetic flat content

Animation, motion graphics, and cartoons sit between natural video and screen content: large areas of flat color, crisp vector edges, and limited texture. The same domain mismatch applies — the metric's learned sense of "how bad is this distortion" was calibrated on photographic content, so it can misrank two animated encodes that a viewer would order clearly. Flat regions are also where banding appears, compounding the problem.

What to do: when comparing encodes on animation, anchor the comparison with at least a small subjective check, and watch flat regions for banding rather than trusting the aggregate score.

Enhancement gaming: when a higher score is a worse picture

One blind spot is self-inflicted. Because a full-reference metric rewards looking different from a degraded encode, sharpening or contrast tricks can raise a VMAF score without improving — sometimes while harming — true fidelity to the source. This is the "gaming the metric" problem, and the fix is the No Enhancement Gain model, VMAF-NEG, now on by default in VMAF v1 (Netflix VMAF v1 blog, 2026). It has its own deep dive in VMAF-NEG explained; the catalogue entry is the reminder that a metric you optimize against will eventually be gamed, by your encoder if not by you.

What to do: if an encoder setting raises VMAF, check it against VMAF-NEG; a gain that disappears under NEG was enhancement, not fidelity.

The catalogue at a glance

One table to keep beside your measurements. Read the last column as the action, not the metric's verdict.

Content / case	Why the score lies	Metrics most affected	What to do instead
Film grain, heavy texture	Random detail never lines up pixel-for-pixel; synthesis mismatches positions	PSNR, SSIM, VMAF	Disable grain synthesis when measuring; measure de-grained; confirm by eye
High motion	Eye masks detail in motion; v0 motion feature unbounded → overprediction	VMAF (esp. v0), PSNR	Use current model; keep frame rate fixed; spot-check fast scenes
High frame rate (60 fps)	Consecutive-frame motion looks small → underprediction vs 30 fps	VMAF v0	Never compare across frame rates; use v1; report frame rate
Dark scenes, HDR	Few bright pixels to "disagree"; no deployed public HDR VMAF model	PSNR, SSIM, VMAF	Inspect shadows; weight subjective checks to dark scenes; treat HDR numbers as indicative
Banding (skies, fades)	Smooth ramp → flat plateaus barely changes pixel statistics	PSNR, SSIM, VMAF v0	Run CAMBI; gate on banding separately
Chroma / color	Luma-only computation ignores color channels	Luma PSNR, SSIM, VMAF v0	Measure chroma (U/V) or a chroma-aware model; check saturated regions
Screen content, text, gaming	Trained on natural video; different statistics and edges	PSNR, SSIM, VMAF	Spot-check legibility; subjective test on real content
Animation, flat synthetic	Domain mismatch with photographic training; flat regions band	PSNR, SSIM, VMAF	Small subjective check; watch flat regions for banding
Enhancement (sharpening)	Metric rewards difference, not fidelity → gameable	VMAF (standard)	Check against VMAF-NEG

Matrix of content types vs PSNR, SSIM, VMAF v0 and v1 showing where each metric is reliable, weak, or blind. Figure 2. The blind-spot matrix. Cells show where each metric is reliable, weak, or blind by content type — the quick reference for which metric to trust on what.

The metrics, side by side: what each measures and where it lies

Zoom out from content to the metrics themselves. The two columns that matter most are the last two.

Metric	Scale / unit	What it measures	Where it lies
PSNR	dB (higher better)	Pixel-wise error vs original, on luma	Weights all errors equally; ignores where and what; blind to banding, chroma, perception
SSIM	0–1	Local luminance, contrast, structure agreement	Luma-only; misses banding and chroma; localized defects average away
MS-SSIM	0–1	SSIM across multiple scales	Same blind spots as SSIM; better across resolutions, still luma and structure only
VMAF v0	0–100	Fused perceptual features, trained on human scores	Blind to banding and chroma; overpredicts high motion; underpredicts 60 fps; gameable by sharpening
VMAF v1	0–100	Adds banding (CAMBI), chroma, NEG-by-default, viewing-distance model	Still working on film grain, high frame rate; remains a natural-content model with a domain

Two reading rules follow from this table. First, a number without its metric, model version, and pooling is not a measurement — "VMAF 95" could be v0 or v1, default or phone, mean- or percentile-pooled, and those are different claims, as reading a quality-metric report details. Second, never compare scores across metrics as if 90 SSIM and 90 VMAF were the same; they are different scales measuring different things.

Why a metric's headline correlation does not transfer to your content

Every metric is sold with a correlation number — how closely it tracked human scores in its validation, measured with Pearson and Spearman correlation and reported in how metrics are validated. That number was earned on a specific test database, and the international methodology for evaluating a metric assumes you report it against representative content for your use (ITU-T P.1401, the standard for objective-metric evaluation). A metric that scored a 0.95 correlation on professionally produced film clips has told you nothing about how it behaves on your surveillance feed, your slide decks, or your animated explainer.

This is the meta-lesson under the whole catalogue. The blind spots above are simply the places where the gap between "correlation on the validation set" and "correlation on your content" is widest. The defense is the same everywhere: measure on your own content, not on someone else's benchmark.

Common mistake: shipping the encoder setting that "improved VMAF." A change that raises mean VMAF on a generic test set can lose quality on your content — banding the metric cannot see, chroma it ignores, or an enhancement that games it. Before you ship, re-measure on a sample of your real content, read VMAF alongside a banding detector and a worst-case pool, check it against VMAF-NEG, and look at the hardest frames. A single rising number is a hypothesis, not a result.

What to do instead: a measurement that does not get fooled

The catalogue's advice rolls up into one playbook. Read a panel of metrics, never a single number: a perceptual metric (VMAF, current model named) plus a structural one (SSIM/MS-SSIM) plus a specialized detector where you have a known blind spot (CAMBI for banding, a chroma metric for color). Pool for the worst case, not just the average, so a short bad stretch cannot hide — the pooling article shows how. Measure on your own content, sampling the genres you actually ship, because correlation does not transfer. Keep every comparison apples-to-apples — same resolution, frames, reference, model, and pooling. And close the loop with the eye: spot-check the worst frames, and for a high-stakes decision run a small subjective test, the only ground truth a metric is ever approximating.

Decision flow: identify content type, add a detector for known blind spots, pool worst-case, then confirm by eye. Figure 3. Before you trust a score: route the content through its blind spots, add a detector where the metric is blind, pool for the worst case, and confirm with the eye.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the content our clients ship is exactly the content that breaks a naive metric. Surveillance is dark scenes and grain; conferencing and e-learning are screen shares and text; OTT and sports are high motion and skies that band. So we never gate a pipeline on a lone VMAF number: we measure on the client's own footage, read VMAF beside a banding detector and a worst-case pool, name the model and version on every figure, and confirm the hard cases by eye. When we report that an encoding change is safe, it is because the metric was used inside its domain and checked where it goes blind — the discipline we document in our benchmark methodology.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your video quality metric limitations plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Metric Blind-Spot Field Guide — A one-page printable reference that maps each content type and edge case (film grain, high motion, 60 fps, dark/HDR, banding, chroma, screen content/text, animation, enhancement gaming) to why the objective score lies, which metrics are….

References

C. G. Bampis, Z. Li, K. Swanson, N. Fons Miret, P. Madhusudanarao, "VMAF v1: Good Is Not Good Enough," Netflix Technology Blog, 2026-06-19. Tier 1 (metric-author primary). Enumerates VMAF v0's known limitations and the v1 fixes: DLM under-penalizing blockiness (AIM added), the viewing-distance/CSF model replacing the phone mapping, banding via CAMBI integration, chroma features added to a previously luma-only metric, NEG enabled by default, and the motion feature's overprediction on high motion plus underprediction at 60 fps — with film grain and high frame rate named as still-open problems. The controlling current source for this article. https://medium.com/netflix-techblog/vmaf-v1-good-is-not-good-enough-60d7e4244ea8
Netflix, "CAMBI" (resource/doc/cambi.md), Netflix/vmaf GitHub repository, accessed 2026-06-23. Tier 1 (metric-author primary). States that PSNR, SSIM and VMAF are not designed to identify banding; defines CAMBI as a no-reference, frame-by-frame banding detector with a 0 (none) to 24 (unwatchable) scale, ~5 the annoyance threshold, and notes banding visibility rises with display brightness and dim ambient light. https://github.com/Netflix/vmaf/blob/master/resource/doc/cambi.md
Netflix, "VMAF Frequently Asked Questions" (resource/doc/faq.md), Netflix/vmaf GitHub repository, accessed 2026-06-23. Tier 1 (metric-author primary). VMAF is a proxy fused from elementary features and trained on subjective scores; documents that the aggregate is the arithmetic mean by default and that humans weight the worst frames more — the basis for treating a VMAF number as a model output, not truth. https://github.com/Netflix/vmaf/blob/master/resource/doc/faq.md
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Tier 1 (metric-author defining work). Defines SSIM on luminance/contrast/structure and motivates it by PSNR/MSE's failure to track perception — the foundation for "a metric reports agreement, not appearance" and for SSIM's luma/structure blind spots. https://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
Recommendation ITU-T P.1401, "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models," International Telecommunication Union, 2020. Tier 1 (official standard). The standard procedure for evaluating an objective metric against subjective scores (PCC, SROCC, RMSE), and the basis for the rule that a metric's reported correlation holds only for representative content — i.e., does not transfer to your content automatically. https://www.itu.int/rec/T-REC-P.1401
Netflix/vmaf, "VMAF and AV1's film grain synthesis" (issue #1192, response by N. Fons Miret), GitHub, accessed 2026-06-23. Tier 2 (metric-author official guidance). Netflix confirms synthesized film grain interferes with VMAF and recommends disabling grain synthesis when computing VMAF; corroborates that current objective metrics do not capture synthesized-grain quality. https://github.com/Netflix/vmaf/issues/1192
Z. Wang, A. C. Bovik, "Mean Squared Error: Love It or Leave It? A New Look at Signal Fidelity Measures," IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98–117, 2009. Tier 5 (peer-reviewed). Shows that very different distortions can share an identical MSE/PSNR while looking dramatically different — the source for the identical-28.13-dB worked example and the "agreement, not appearance" framing. https://ieeexplore.ieee.org/document/4775883
A. Norkin, N. Birkbeck, "Film Grain Synthesis for AV1 Video Codec," IEEE Data Compression Conference (DCC), 2018. Tier 5 (peer-reviewed). Describes AV1 film-grain synthesis and states that the evaluation used an informal subjective test because objective metrics do not work well when grain is synthesized — primary support for the film-grain blind spot. https://norkin.org/pdf/DCC_2018_AV1_film_grain.pdf
J. Chen et al., "An Overview of Coding Tools in AV1: the First Video Codec from the Alliance for Open Media," APSIPA Transactions on Signal and Information Processing, 2020. Tier 5 (peer-reviewed). Reports up to 50% bitrate savings from film-grain synthesis on heavily grained content and that the tool is excluded from objective-metric comparisons because individual grain positions mismatch — the source for the grain bitrate figure. https://www.cambridge.org/core/journals/apsipa-transactions-on-signal-and-information-processing/article/an-overview-of-coding-tools-in-av1-the-first-video-codec-from-the-alliance-for-open-media/5972E00494363BE37E3439FAE382DB10
R. Rao et al., "HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation," arXiv:2505.21831, 2025. Tier 5 (peer-reviewed/institutional). A subjective dataset built because standard-range objective metrics, weak in dark regions and without a deployed public HDR VMAF model, do not transfer to HDR — support for the dark-scene/HDR blind spot. https://arxiv.org/abs/2505.21831

Why this matters

A metric is a model, and every model has a domain

The mental model: a metric reports agreement, not appearance