Reading a Video Quality Report Without Fooling Yourself

Why this matters

You will read far more quality reports than you produce: a vendor's codec benchmark, a teammate's encoder A/B, a dashboard number, a paper's results table. Each one is asking you to make a decision — ship this encoder, accept this release, trust this claim — and each can be technically correct and still misleading. This article is for the encoding lead weighing a codec switch, the QA engineer signing off a build, the streaming engineer reading a delivered-quality dashboard, and the product owner who has to trust the figure underneath. The skill it builds is cheap insurance: a few questions, asked every time, that stop a number from making a decision you would not have made with the full picture.

A score is a sentence with the subject removed

Start with the habit that fixes everything else. A quality score is not a fact about a video; it is the answer a specific instrument gave under specific conditions. "VMAF 95" is like the sentence "scored 95" with no subject, no test, and no scale — unreadable until you supply the missing parts. The number that fuses several perceptual features into a 0-to-100 prediction of how a viewer would rate a video, called VMAF (Video Multimethod Assessment Fusion), depends on which trained model you ran, how you collapsed thousands of per-frame numbers into one, and whether the file you measured even had a clean original to compare against.

So reading a report is not "is the number high?" It is "what does this number actually claim, and does the claim survive scrutiny?" The rest of this article is six questions that turn a bare score back into a full sentence. Ask them in order; the first one that fails usually settles the matter.

Annotated quality-metric report showing where the metric, model, pooling, confidence interval, content set, and reference are declared or missing. Figure 1. The same report, read twice. A bare "VMAF 95" (left) hides everything that gives it meaning; a complete report (right) names the metric and implementation, the model, the pooling, the spread, the content, and the reference.

Question 1 — What metric is this, and whose implementation produced it?

First, name the metric and what it measures, because the metrics are not interchangeable and not on one scale. PSNR (Peak Signal-to-Noise Ratio), measured in decibels, reports how much the pixels differ from the original; it says nothing about whether the difference is visible. SSIM (the Structural Similarity Index), on a 0-to-1 scale, compares local luminance, contrast, and structure. VMAF, on a 0-to-100 scale, predicts a human rating. A report that says "quality 0.95" without naming the metric is unreadable: 0.95 is an excellent SSIM and an impossible VMAF. If you only need the one-screen version of each metric while tuning an encode, the Video Encoding section's combined metrics overview is the short form; this article is the deep dive on how to read the numbers it produces.

Then ask the question almost everyone skips: whose implementation? The same metric name can produce different numbers in different tools. Netflix's own SSIM, for example, includes an empirical downsampling step that the SSIM filter in FFmpeg does not, so the two report different SSIM values for the same pair (Netflix VMAF FAQ, 2026). PSNR is worse: the VMAF package caps PSNR at 60 dB for 8-bit and 72 dB for 10-bit content using the rule of thumb 6·N + 12, while another tool may report a higher or unbounded value for an identical frame (Netflix VMAF FAQ, 2026). The practical rule: an SSIM or PSNR number is only comparable to another produced by the same tool and version. Two SSIM columns from two tools are two different rulers.

Question 2 — Which model produced the score?

For VMAF, the model is part of the number, and omitting it is the most common way a report misleads. The default VMAF model predicts quality for a 1080p TV viewed at three times the screen height (about 60 pixels per degree); the phone model assumes a small handheld screen and reads several points higher for the same file; the 4K model predicts a 4K TV viewed at 1.5 times the screen height (Netflix VMAF models documentation, 2026; Netflix VMAF FAQ, 2026). These are different questions with different answers. A "VMAF 93" from the phone model and a "VMAF 93" from the default model are not the same quality — the phone-model file is visibly worse on a television.

So a delivered-quality report must name the model, or its number means "somewhere in a 4–6 point band, depending on which model I chose and did not tell you." When a report omits the model, treat the score as provisional and ask. The deeper treatment of model selection lives in VMAF in depth.

Question 3 — How were the per-frame scores pooled?

Every full-reference metric produces one score per frame; the single headline number is a summary of thousands of them, and the summary method changes the story. By default VMAF reports the arithmetic mean of the per-frame scores, for simplicity and consistency with mean PSNR (Netflix VMAF FAQ, 2026). But the mean is exactly the statistic that hides a short disaster: a clip that is excellent for 59 seconds and falls apart for one can post a mean in the 90s while a viewer remembers only the bad second.

Because human opinion weighs the worst moments more heavily, the VMAF tools expose other pooling methods — harmonic mean, median, min, and the 5th, 10th, and 20th percentiles (Netflix VMAF FAQ, 2026). A report that quotes only the mean is answering "how good was the average frame?" when the decision usually depends on "how bad was the worst stretch?" The fix when you read one is to ask for a low percentile or the min alongside the mean. A mean of 93 with a 5th-percentile of 71 is a different release decision than a mean of 93 with a 5th-percentile of 90. The pooling article covers why the worst seconds dominate perceived quality.

Question 4 — What is the spread? Read the confidence interval

A VMAF score is a prediction from a model trained on a sample of human ratings, so it carries uncertainty — and VMAF can report that uncertainty as a 95% confidence interval using a bootstrapped model (Netflix VMAF confidence-interval documentation, 2026). The confidence interval is the range the true score is likely to fall in. Ignore it and you will mistake noise for a result.

Here is the arithmetic, using the figures from Netflix's own example. A bootstrapped run reports a score and a standard deviation:

BOOTSTRAP_VMAF_score        = 75.44
BOOTSTRAP_VMAF_stddev_score =  1.31

Assuming a normal distribution, the 95% confidence interval is the score plus or minus 1.96 standard deviations:

95% CI = 75.44 ± 1.96 × 1.31
       = 75.44 ± 2.57
       = [72.87, 78.01]

So "VMAF 75.4" is really "somewhere between about 73 and 78, with 95% confidence." Now apply that to a comparison. Suppose a report declares encoder B the winner:

Encoder A : VMAF 93.0   (95% CI 91.5 – 94.5)
Encoder B : VMAF 94.0   (95% CI 92.4 – 95.6)

The intervals overlap almost completely. The one-point "win" sits well inside the noise, so the honest reading is "no measurable difference at this sample size," not "B is better." A report that quotes single scores to one decimal place and declares a winner, with no interval and no spread, is inviting you to over-read it. Note too that intervals are not uniform: in VMAF's training data the high-score region is denser, so high scores carry tighter intervals than low ones (Netflix VMAF confidence-interval documentation, 2026).

Dot-and-whisker plot of two encoders whose 95 percent VMAF confidence intervals overlap, showing a one-point difference is within noise. Figure 2. A one-point win inside the error bars is not a win. Encoder A (93.0) and Encoder B (94.0) have overlapping 95% confidence intervals, so the difference is not measurable at this sample size.

Question 5 — Is the comparison apples-to-apples?

A metric comparison is only valid when everything except the thing under test is held constant: same source frames, same reference, same resolution at which the metric was computed, same model, same pooling. Break any one and the comparison is meaningless, however precise the numbers look.

Resolution is the trap that catches the most people. Because VMAF models a fixed viewing distance, scoring a low-resolution pair at its native resolution behaves as if the small frame were cropped from a larger one and viewed from much farther away, which hides artifacts and inflates the score. Netflix's own FAQ works the numbers: a 480-line pair scored natively models a viewing distance of 6.75 times the frame height instead of the intended 3, "going to hide a lot of artifacts, hence yielding a very high score," and so "one should NOT compare the absolute VMAF score of a 1080 video with the score of a 480 video obtained at its native resolution" (Netflix VMAF FAQ, 2026). The correct method is to upscale the distorted video to the reference resolution and compute VMAF there.

The effect is large enough to flip a decision. The same 480p encode can read far higher scored natively than scored fairly against its 1080p source — the file did not change, only the measurement did.

Same 480p encode, two ways to measure (illustrative):
   Scored natively at 480p              → VMAF 96   (models a 6.75H viewing distance)
   Upscaled to 1080p, scored vs source  → VMAF 82   (models the intended 3H distance)

When you read a multi-resolution comparison, the first thing to check is the resolution the metric ran at. If the rungs of a bitrate ladder were each scored at their own native resolution, the comparison is invalid and the low rungs are flattered.

Bar comparison of the same 480p file scored natively versus upscaled to 1080p, showing the native score is inflated by viewing-distance assumptions. Figure 3. The same file, two numbers. Scoring a 480p encode at its native resolution inflates VMAF because the model assumes a farther viewing distance; scoring it upscaled against the 1080p source is the apples-to-apples number.

Question 6 — Does the content, or a missing reference, break the metric?

Finally, ask whether the metric was even applicable to this content. Every objective metric is a proxy validated against human ratings on certain material, and each has content where it lies. VMAF was designed for compression and scaling artifacts in adaptive streaming; for other impairments such as packet loss or transmission errors, "the perceptual quality ... MAY be predicted inaccurately" (Netflix VMAF FAQ, 2026). PSNR, SSIM, and VMAF v0.6.1 barely register banding on smooth skies; grain and heavy texture confuse full-reference metrics; dark and HDR scenes hide errors the eye catches. A high score on banding-prone or grainy content is not evidence the content looks good — it is evidence the metric is blind here. The blind-spot catalogue maps which metric fails on which content.

The other half of this question is whether there was a real reference. PSNR, SSIM, and VMAF are full-reference metrics: they need the pristine original, frame-aligned with the distorted file. A surprising number of bad scores come not from a bad encode but from a broken reference — a one-frame offset between distorted and reference, or missing color metadata that makes the tool interpret the pixels in the wrong color space and tanks the score (Ozer, Streaming Learning Center, 2024). Before you trust a low number, confirm the two files were aligned and tagged correctly. And if the content is live or user-generated, there is no original at all, so any full-reference number is not weak — it is impossible, and the report should be using a no-reference metric instead. The three measurement setups article covers that distinction.

The reviewer's checklist, in one table

Keep this beside any report. The left column is what the report shows; the middle is what to verify; the right is how it fools you if you skip the check.

What the report shows	What to check	How it fools you
A bare score ("VMAF 95")	Metric named? Implementation/tool named?	0.95 SSIM ≠ 0.95 anything-else; two tools' SSIM differ
A VMAF number	Which model — default, phone, or 4K?	Phone model reads several points high vs a TV
One headline number	Pooling — mean, or a low percentile/min?	Mean hides a short bad stretch a viewer remembers
"A beats B by 1 point"	Confidence interval / spread shown?	A 1-point gap sits inside the error bars — it's noise
A multi-resolution table	Resolution the metric ran at	Native low-res scores are inflated by viewing distance
A high score on hard content	Banding, grain, dark, packet loss?	The metric is blind there; high ≠ good
Any full-reference score	Real, frame-aligned, correctly tagged reference?	A misalignment or color-tag bug tanks a good encode

Six-question reviewer checklist mapping each question to the way a quality report misleads when the question is skipped. Figure 4. Six questions for any report. Each turns a bare number back into a full claim; the first question that fails usually settles whether to trust the report.

A note on BD-rate reports

One number deserves its own warning because it is so often misread: BD-rate (the Bjøntegaard Delta rate). It is not a quality score at all — it is the average bitrate difference between two encoders at equal quality, computed from their rate-quality curves (Bjøntegaard, VCEG-M33, 2001). The sign trips people: a negative BD-rate means the test encoder needs fewer bits for the same quality (a win), and a positive one means it needs more. A report that says "BD-rate −30%" is claiming a 30% bitrate saving at matched quality, not a 30% quality gain. And like every metric above, a BD-rate is only valid with the underlying metric, model, content set, and anchor named. The BD-rate article works through it with our own numbers.

Common mistakes when reading a report

The perfect-score assumption. It is natural to expect a file compared against itself to score VMAF 100. It does not — a machine-learned predictor typically returns something like 98.7 for an identical pair (Netflix VMAF FAQ, 2026). So "not 100" does not mean "degraded," and a report that flags a 98.7 self-comparison as a defect is misreading its own metric.

Beyond that, four mistakes recur. The first is reading the mean alone and missing the worst seconds — always look for a percentile or min. The second is comparing across metric scales, as if "SSIM 0.95 beats VMAF 90" meant anything; they are different rulers. The third is comparing across implementations or resolutions, where the same content yields different numbers for reasons that have nothing to do with quality. The fourth is treating a single objective number as ground truth: every metric is a proxy for a human rating, and when a number and a careful viewer disagree, the subjective test is right and the metric missed. The deeper "how do we even know a metric tracks the eye" question — Pearson and Spearman correlation, RMSE, the ITU-T P.1401 procedure — is covered in validating metrics against human scores.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and most quality reports we read come from outside: a codec vendor's benchmark, a client's existing pipeline, an open-source comparison. We read every one with the six questions above before it informs a decision, because a flattering number measured at native low resolution or pooled by the mean has cost more than one team a wrong codec choice. When we publish our own numbers, we attach the metric, model, pooling, confidence interval, content set, and date so a reader can apply the same scrutiny to us; the standing rules are written down in our benchmark methodology. The discipline is symmetric: read others' reports skeptically, and make your own survive the same reading.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your reading a video quality report plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Metric-Report Reviewer's Checklist — A one-page printable checklist (A4) mapping the article's six questions to tick-box items — metric & tool, VMAF model, pooling, confidence interval, apples-to-apples resolution, and content/reference — plus a worked confidence-interval….

References

Netflix, "VMAF Confidence Interval" (resource/doc/conf_interval.md), Netflix/vmaf GitHub repository, accessed 2026-06-23. Tier 1 (metric-author primary). Documents the 95% confidence interval via bootstrapping (since VDK v1.3.7, June 2018), the BOOTSTRAP_VMAF_score / stddev / ci95_low / ci95_high fields, the score ± 1.96·stddev normal-assumption interval, plain vs residue bootstrapping, and that high scores carry tighter intervals. The controlling source for the worked confidence-interval example. https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md
Netflix, "VMAF Frequently Asked Questions" (resource/doc/faq.md), Netflix/vmaf GitHub repository, accessed 2026-06-23. Tier 1 (metric-author primary). Source for the native-low-resolution inflation and 6.75H viewing-distance math, the apples-to-oranges warning, the default arithmetic-mean pooling and the --pool options (mean, harmonic_mean, median, min, perc5/10/20), the compression-and-scaling-only design (packet loss may be inaccurate), the SSIM-implementation difference, the PSNR 60/72 dB caps, and the ~98.7 self-comparison. https://github.com/Netflix/vmaf/blob/master/resource/doc/faq.md
Netflix, "VMAF Models" (resource/doc/models.md), Netflix/vmaf GitHub repository, accessed 2026-06-23. Tier 1 (metric-author primary). Documents the default 1080p-HDTV-at-3H model, the --phone-model (reads higher), and the 4K-at-1.5H model — the basis for "name the model." https://github.com/Netflix/vmaf/blob/master/resource/doc/models.md
Recommendation ITU-T P.1401, "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models," International Telecommunication Union, 2020 (T-REC-P.1401-202001). Tier 1 (official standard). Defines the PCC, SROCC, RMSE, epsilon-insensitive RMSE*, and outlier-ratio procedure for judging whether a metric tracks human scores — the statistics a credible correlation claim in a report must use. https://www.itu.int/rec/T-REC-P.1401
Recommendation ITU-T P.910, "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, 2023. Tier 1 (official standard). Defines the subjective methods and viewing conditions that are the ground truth every objective number approximates — the basis for confirming a disputed score by eye. https://www.itu.int/rec/T-REC-P.910
G. Bjøntegaard, "Calculation of average PSNR differences between RD-curves," ITU-T SG16 VCEG, document VCEG-M33, Austin, 2001. Tier 1 (defining note). Defines BD-rate as the average bitrate difference between two rate-quality curves at equal quality — the basis for reading a BD-rate as a bitrate saving, not a quality score, and for its sign convention. https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc
FFmpeg, "libvmaf" filter (ffmpeg-filters documentation), accessed 2026-06-23. Tier 3 (first-party tooling). Current libvmaf syntax for generating a report — model='version=...', feature='name=psnr|name=float_ssim', pool (min/harmonic_mean/mean), log_fmt=json, n_threads — the source for the report format the linter parses. https://ffmpeg.org/ffmpeg-filters.html#libvmaf
J. Ozer, "Computing VMAF: How Missing Color Metadata Tanked the Score," Streaming Learning Center, 2024. Tier 6 (high-quality educational). A concrete worked case of a correctly encoded file scoring far too low because of missing color tags — the basis for the "confirm the reference is aligned and tagged" check. https://streaminglearningcenter.com/articles/computing-vmaf-its-all-about-the-edge-cases.html
A. Antsiferova, S. Lavrushkin, M. Smirnov, A. Gushchin, D. Vatolin, D. Kulikov, "Objective video quality metrics application to video codecs comparisons: choosing the best for subjective quality estimation," arXiv:2107.10220, 2021. Tier 5 (peer-reviewed/institutional). A systematic study showing how metric version, pooling, and color-component summation change a codec comparison — support for naming every setting before trusting a comparison. https://arxiv.org/abs/2107.10220
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Tier 1 (metric-author defining work). Defines SSIM and motivates it by PSNR/MSE's weak correlation with perception — the basis for "the metrics are different rulers, not points on one scale." https://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf

Why this matters

A score is a sentence with the subject removed

Question 1 — What metric is this, and whose implementation produced it?

Question 2 — Which model produced the score?

Question 3 — How were the per-frame scores pooled?

Question 4 — What is the spread? Read the confidence interval

Question 5 — Is the comparison apples-to-apples?

Question 6 — Does the content, or a missing reference, break the metric?

The reviewer's checklist, in one table

A note on BD-rate reports

Common mistakes when reading a report

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Reading a Video Quality Report Without Fooling Yourself

Why this matters

A score is a sentence with the subject removed

Question 1 — What metric is this, and whose implementation produced it?

Question 2 — Which model produced the score?

Question 3 — How were the per-frame scores pooled?

Question 4 — What is the spread? Read the confidence interval

Question 5 — Is the comparison apples-to-apples?

Question 6 — Does the content, or a missing reference, break the metric?

The reviewer's checklist, in one table

A note on BD-rate reports

Common mistakes when reading a report

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

VMAF

SSIM

PSNR

Pooling

Confidence interval

BD-rate

Viewing distance

FFmpeg