What a Metric Can and Cannot Tell You

Why this matters

If you are about to make a real decision from a quality score — ship this encode, pick that codec, lower this bitrate — you are trusting a number to stand in for thousands of human eyes, and you need to know precisely how far that trust can stretch. This article is written for the streaming or encoding lead, the QA engineer, and the technical product owner who has started measuring quality and now has to defend the results without being fooled by them. The job it does is to draw the boundary: it names the handful of jobs a metric does well, the blind spots that have sunk real projects, and the discipline that keeps a number honest. It is the reality check that the rest of this section — every individual metric article — builds on, and it is the article you reread before you quote a score to someone who will spend money on it.

A metric is a forecast, not the weather

Start with the one idea everything else hangs on. An objective quality metric — any algorithm that reads two videos and prints a number, like PSNR, SSIM, or VMAF — is a model that was trained or tuned to predict what humans would say if you asked them. The ground truth of video quality is a subjective test: real viewers, controlled conditions, rating real clips. The metric is the fast, cheap stand-in for that slow, expensive panel.

That makes a metric a forecast. A weather forecast is genuinely useful — you take an umbrella when it says 80% rain — but you never confuse the forecast with the sky. A quality metric is the same kind of object: a well-built prediction of human opinion, with a track record you can check, and an error it carries on every single reading. When the metric says "VMAF 93" it is really saying "a trained model predicts viewers would rate this around 93, give or take." The "give or take" is not a footnote. It is the whole subject of this article.

Because a metric is a prediction, it can be graded — and the grading procedure is itself a standard. ITU-T Recommendation P.1401 (2020) defines how to score an objective metric against subjective data using four numbers: the Pearson correlation coefficient (PCC) for how linearly the metric tracks opinion, the Spearman rank-order correlation (SROCC) for whether it ranks clips in the right order, the root-mean-square error (RMSE) for how far off its predictions land, and the outlier ratio for how often it is wrong by more than human uncertainty can explain. Keep those four words in mind; they are how the field admits, in public, that no metric is perfect.

Scatter of metric score vs subjective MOS with a fit line and two circled outliers, graded by four ITU-T P.1401 numbers. Figure 1. An objective metric predicts the mean human opinion score. The fit is good but never perfect — the circled points are content where the metric and the eye disagree, and the four P.1401 numbers grade exactly that.

What a metric can tell you

Used inside its limits, a metric earns its keep at four jobs, and it does them better than any human panel could afford to.

It can rank two encodes of the same content. If clip A scores VMAF 95 and clip B scores VMAF 88, measured the same way, A is almost certainly the better-looking encode — ranking is the thing metrics are explicitly graded on (that is what SROCC measures). It can catch a regression: when last week's pipeline produced VMAF 94 on a reference clip and today's produces 81, something broke, and the number found it while you slept. It can track a trend at scale, watching ten thousand assets or a live channel over time and flagging the ones that drift. And it can give a defensible figure for a report or a contract, because the method is written down and the result reproduces.

Notice the common thread: every one of those jobs is relative or comparative. The metric shines when it compares like with like — same content, same resolution, same model, same pooling. That is the green zone. Most metric failures happen when someone carries the number outside it.

How big is the "give or take"? A worked example

Here is the arithmetic that makes the error bar concrete, because once you have seen it you will never read a single VMAF score the same way. Since version 1.3.7 (June 2018), VMAF can report a 95% confidence interval — a range that quantifies how sure the model is — by training many models on resampled data (a technique called bootstrapping) and watching how much their predictions disagree. The tool gives you a score and the standard deviation of those models, and the interval follows a one-line formula:

95% CI  =  score  ±  1.96 × standard deviation

Example (from the VMAF documentation's own output):
  score              = 75.4
  standard deviation = 1.31
  margin             = 1.96 × 1.31 = 2.57
  95% CI             = 75.4 − 2.57  to  75.4 + 2.57
                     = 72.9  to  77.4

Read what that means. A VMAF of 75.4 on this clip is really "somewhere between about 73 and 77, with 95% confidence." So if encode A scores 75 and encode B scores 77, the metric cannot tell them apart — their intervals overlap completely. Treating a two-point VMAF difference as a real win is reading noise as signal. The VMAF documentation even notes that low scores carry wider intervals than high scores, because the training data was denser at the high end — so the metric is least certain exactly where quality is worst and decisions matter most.

Common mistake: ranking encodes by differences smaller than the error bar. "We picked encoder B; it scored 0.6 VMAF higher" is a decision made on noise. Before you declare a winner, check whether the gap exceeds the confidence interval (or run a proper subjective test when the gap is small and the stakes are high). A metric with no error bar attached is a metric you are over-trusting.

What a metric cannot tell you

Now the other side of the line. A quality score has five blind spots that recur in real projects, and naming them is the entire point of this article.

Five cards: worst frames, untrained content, wrong viewing condition, out-of-scope artifacts, and no absolute verdict. Figure 2. Five things a single quality number will not tell you. Each is a place where a confident-looking score quietly stops meaning what you think it means.

1. It cannot tell you about the worst few seconds

A clip-level score is a summary of hundreds of per-frame scores, and summarizing throws information away. By default most tools, VMAF included, report the arithmetic mean of the per-frame scores. But the VMAF documentation itself warns that humans weigh the worst moments more heavily than a simple average does — a single ugly stall or a smeared scene is what the viewer remembers, even if every other frame was flawless. A mean of 92 can hide a five-second stretch at 60.

This is why pooling — the rule for turning per-frame scores into one number — matters as much as the metric. The harmonic mean weights low frames more; the 5th-percentile score reports the quality of the worst 5% of frames directly. The single mean number will never volunteer that the bad scene exists. You have to ask for the worst-frame statistic, and most teams never do.

Per-frame VMAF dips in one scene; mean, harmonic mean and 5th-percentile pooling lines sit at very different heights. Figure 3. The same clip, three honest numbers. The arithmetic mean sits comfortably high; the 5th percentile reports the scene the viewer will actually complain about.

2. It cannot judge content it was never trained on

A learned metric like VMAF knows what its training set taught it, and it was trained mostly on conventional encoded film and TV at compression and scaling artifacts. Push it onto content outside that experience and the prediction degrades — silently, with no warning flag. The clearest case is banding, those visible staircase steps in what should be a smooth sky or gradient. Banding sits on sparse contours while most of the frame looks fine, so a full-frame metric barely notices it; measured against human scores, VMAF v0 reaches a rank correlation of only about 0.34 on banding, where 1.0 would be perfect. Film grain, screen-recorded text, animation, and heavy chroma (color) artifacts are similar weak spots — VMAF v0 reads only the luma (brightness) channel, so it is structurally blind to color-only damage.

Common mistake: trusting a full-reference score on content it never saw. If you are measuring animation, screen capture, grainy film, or anything with strong color shifts, a default VMAF number is an educated guess at best. Validate the metric on a sample of your content against human eyes before you trust it across a catalog — see how metrics are validated against human scores.

3. It cannot tell you what it was never validated to tell you

Every perceptual metric bakes in assumptions about the viewing condition — screen size and distance — and the score only means what the model was trained to mean. The VMAF documentation gives the sharpest example: compute the default model on a native 480p clip and the score comes back surprisingly high, because the model effectively assumes you are sitting far enough away that a 480p frame looks like a small region of a 1080p screen. The documentation calls comparing a native-480p score against a native-1080p score an "apples-to-oranges" comparison in plain words. There are separate models for 4K screens and for phone viewing precisely because one model cannot speak for every condition. Use the wrong model and the number is confidently wrong.

4. It cannot see artifacts outside its scope

VMAF was designed for adaptive streaming, so it models two kinds of damage: compression artifacts and scaling artifacts. The documentation states plainly that other impairments — packet loss, transmission errors, and the like — "may be predicted inaccurately." Temporal problems are a known weak area too: flicker, ghosting, and judder live in how frames change over time, and VMAF's temporal feature is a basic motion measure that does not capture them well. A no-reference live stream with a packet-loss glitch can still post a healthy full-reference score, because the metric is not looking for that kind of wound.

5. It cannot tell you "is this good enough" on its own

Ranking is reliable; an absolute verdict is not. The VMAF FAQ notes that comparing a video with itself does not even return a perfect 100 — you get something like 98.7, because a machine-learning predictor has residual error everywhere. So "VMAF 93 is good" is only meaningful once you have anchored, on your content and your devices, what score corresponds to a quality your viewers accept. The number is a thermometer reading; whether the room is comfortable is a judgment you still have to make.

The deepest trap: optimizing for the metric

There is one mistake worse than misreading a score: making the score your goal. The economist Charles Goodhart's law, in Marilyn Strathern's well-known phrasing, warns that "when a measure becomes a target, it ceases to be a good measure." Video quality is a textbook case. Because every metric is an imperfect proxy, there is always a gap between the number and the perception — and anything you do to chase the number through that gap improves the score without improving the picture.

This is not theoretical. Image-enhancement tricks applied before encoding — sharpening, contrast boosts, histogram equalization — reliably raise a VMAF score, and at least one widely used encoder mode (libaom's tune=vmaf) was shown to win much of its measured gain by sharpening the frame before compression, not by compressing better. Netflix built a second model, VMAF-NEG (No Enhancement Gain), specifically to subtract that kind of inflation and report a more conservative score; the VMAF-NEG article covers it in depth. And the honesty goes one level deeper still: research that set out to "hack" both VMAF and VMAF-NEG found preprocessing that fools even the no-enhancement-gain model. The lesson is not that VMAF is broken — it is that any target can be gamed, so the metric must stay a measurement, never the objective.

Pre-encode sharpening lifts VMAF with no real gain; VMAF-NEG subtracts it, with Goodhart's law captioned below. Figure 4. Sharpen before you encode and VMAF rises while the picture does not. When a measure becomes a target, it stops measuring — which is why a conservative model like VMAF-NEG exists.

When the metric and a careful human viewing disagree, the human wins. The metric did not reveal a hidden truth; it failed on that content, and the subjective result is the ground truth it was always trying to approximate.

Four metrics, side by side

Because this is the measurement-honest section, the table names not just what each metric measures but where it lies. No metric is "best"; each is a tool with a documented blind spot.

Metric	What it measures	What it can tell you	Where it lies (the blind spot)
PSNR (dB)	Pixel-by-pixel error vs the original, in decibels	A fast fidelity check; large drops are real damage	Weak correlation with the eye; treats all pixel errors equally, ignores structure — high PSNR can still look bad
SSIM (0–1)	Structural similarity — luminance, contrast, structure	Tracks perceived quality far better than PSNR	Single-scale; misses some motion and color issues; not on a perceptual 0–100 scale
VMAF (0–100)	Fused features trained to predict human opinion	The industry default for ranking streaming encodes	Banding, chroma, grain, temporal flicker, packet loss; needs the right model; gameable by enhancement
No-reference (varies)	Quality from the impaired video alone, no original	The only option for live and UGC with no reference	A trend signal, not absolute ground truth; weaker correlation than full-reference

Table 1. The right-hand column is the one most articles omit. Choosing well means matching the metric to the job and knowing its blind spot before you quote it.

How to keep a metric honest

The discipline that follows from all of this is short, and it is the difference between measurement and measurement theater. Always state the metric with its model and pooling method — "VMAF 93, default model, harmonic-mean pooled" — never a bare number. Report a worst-frame statistic (the 5th percentile or the minimum) next to the average, so the bad scene cannot hide. Keep every comparison apples-to-apples: same resolution, same frames, same reference, same model, same pooling. Attach the confidence interval when the gap between two encodes is small. And spot-check with human eyes on a sample, especially on content the metric was never trained for. A number that survives all five of those checks is one you can put in front of a budget owner; a number that skips them is an opinion wearing a lab coat.

Living proof: metrics evolve because of their blind spots

The best evidence that these limits are real is that the field keeps fixing them in public. In June 2026 Netflix released VMAF v1, a new generation of models built to close the exact gaps named above: it integrates the Contrast-Aware Multiscale Banding Index (CAMBI) to finally catch banding, adds chroma-channel features so the metric can see color artifacts it was previously blind to, and drops the heavy VIF feature to run faster. That is the system working as it should — the blind spots are documented, then engineered away, and a new generation inherits a new set of limits to be honest about. A metric is never finished, which is the most important thing it cannot tell you about itself.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, OTT and internet TV, video conferencing, e-learning, telemedicine, and surveillance — and the habit that keeps our quality claims trustworthy is the one this article argues for: we treat every metric as a proxy and name its limits out loud. When a master file exists we measure with full-reference VMAF at a stated model and pooling, and we report a worst-frame percentile alongside the mean so a single bad scene cannot pass a quality gate. When there is no reference — a live conferencing or surveillance feed — we lean on no-reference signals and read them as trends, not verdicts. Our benchmark methodology documents the model, pooling, and conditions behind every number, so a client can see not only what we measured but what the measurement could not see.

To put the worst-frame idea in your own hands, we built a small metric-report sanity check — point it at a per-frame VMAF or PSNR log and it prints the mean, the harmonic mean, the median, the worst-5%/worst-10% percentiles, and the minimum side by side, then flags how far the average sits above the worst frames so the number you quote is the honest one.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your what a metric can and cannot tell you plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.

References

Recommendation ITU-T P.1401 (01/2020), Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunication Union. Tier 1 (official standard). Defines PCC, SROCC, RMSE, and outlier ratio as the procedure for grading any objective metric against subjective MOS. https://www.itu.int/rec/T-REC-P.1401
Recommendation ITU-T P.910 (2023), Subjective video quality assessment methods for multimedia applications. International Telecommunication Union. Tier 1 (official standard). The subjective methods that are the ground truth every objective metric is validated against. https://www.itu.int/rec/T-REC-P.910
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity (SSIM)," IEEE Transactions on Image Processing, vol. 13, no. 4, 2004. Tier 1 (metric author). The argument that pixel error (PSNR) correlates weakly with perception and that structure tracks the eye better. https://ece.uwaterloo.ca/~z70wang/publications/ssim.html
Netflix / VMAF project, Frequently Asked Questions (resource/doc/faq.md), accessed 2026-06-22. Tier 3 (metric author / first-party). Source for the viewing-distance assumption, the native-480p-vs-1080p "apples-to-oranges" warning, the 4K-model caveat, default arithmetic-mean pooling vs worst-frame weighting, the compression-and-scaling-only artifact scope, and the ~98.7 self-comparison. https://github.com/Netflix/vmaf/blob/master/resource/doc/faq.md
Netflix / VMAF project, VMAF Confidence Interval (resource/doc/conf_interval.md), accessed 2026-06-22. Tier 3 (metric author / first-party). The 95% CI via bootstrapping since v1.3.7 (June 2018), the score ± 1.96 × stddev formula, and the note that low scores carry wider intervals. https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md
Netflix Technology Blog, "Toward a Better Quality Metric for the Video Community," 2020. Tier 4 (credible deployer). VMAF-NEG and how sharpening, contrast, and histogram equalization inflate a VMAF score without improving real quality; libaom tune=vmaf gains largely from pre-sharpening. https://netflixtechblog.com/toward-a-better-quality-metric-for-the-video-community-7ed94e752a30
Netflix Technology Blog, "VMAF v1: Good Is Not Good Enough," June 2026. Tier 4 (credible deployer). The v1 model generation integrates CAMBI for banding, adds chroma-channel features, and removes VIF — closing the banding/chroma blind spots of v0. https://medium.com/netflix-techblog/vmaf-v1-good-is-not-good-enough-60d7e4244ea8
A. Zvezdakova et al., "Hacking VMAF and VMAF NEG: Vulnerability to Different Preprocessing Methods," ACM proceedings, 2021 (arXiv:2107.04510). Tier 5 (peer-reviewed). Demonstrates preprocessing that inflates both VMAF and the no-enhancement-gain model — evidence that any target can be gamed. https://arxiv.org/abs/2107.04510
Subjective and objective study of banding artifacts on compressed video, 2025 (arXiv:2508.08700). Tier 5 (peer-reviewed). Reports that general full-reference models, VMAF included, correlate poorly with human banding scores (VMAF SROCC ≈ 0.34). https://arxiv.org/abs/2508.08700
Q. Huynh-Thu and M. Ghanbari, "Scope of validity of PSNR in image/video quality assessment," Electronics Letters, vol. 44, no. 13, 2008. Tier 5 (peer-reviewed). PSNR is only a valid quality indicator when content and codec are fixed — it cannot be compared across different content. https://digital-library.theiet.org/content/journals/10.1049/el_20080522
M. Strathern, "'Improving ratings': audit in the British University system," European Review, 1997 (the widely cited modern phrasing of Goodhart's law). Tier 6 (orientation). "When a measure becomes a target, it ceases to be a good measure." https://doi.org/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4

Why this matters

A metric is a forecast, not the weather

What a metric can tell you

How big is the "give or take"? A worked example

What a metric cannot tell you

1. It cannot tell you about the worst few seconds

2. It cannot judge content it was never trained on

3. It cannot tell you what it was never validated to tell you

4. It cannot see artifacts outside its scope

5. It cannot tell you "is this good enough" on its own

The deepest trap: optimizing for the metric

Four metrics, side by side

How to keep a metric honest

Living proof: metrics evolve because of their blind spots

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

What a Metric Can and Cannot Tell You

Why this matters

A metric is a forecast, not the weather

What a metric can tell you

How big is the "give or take"? A worked example

What a metric cannot tell you

1. It cannot tell you about the worst few seconds

2. It cannot judge content it was never trained on

3. It cannot tell you what it was never validated to tell you

4. It cannot see artifacts outside its scope

5. It cannot tell you "is this good enough" on its own

The deepest trap: optimizing for the metric

Four metrics, side by side

How to keep a metric honest

Living proof: metrics evolve because of their blind spots

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

VMAF

Banding

PSNR

Pooling

VMAF-NEG

Chroma

SSIM

Confidence interval