Why this matters

You can run a flawless quality gate, keep a disciplined regression suite, and monitor production at scale, and still have the work ignored — because the person who signs off on the release reads a number they do not understand, or worse, one that quietly lies to them. This article is for the QA engineer, streaming lead, or technical product owner who has the measurements and now has to put them in front of a client, an executive, or a release manager who will make a go/no-go call. Its job is to make your report believable on the first read and defensible on the second. Getting this wrong does not just lose an argument; it teaches a stakeholder that quality numbers are spin, and that lesson is expensive to undo.

Three readers, one report

A QC report has at least three readers, and they do not want the same thing. The engineer who built the encode wants the evidence — the per-frame curve, the failing segment, the exact model and settings — so they can fix the problem. The product owner or release manager wants the verdict and the risk: ship or hold, and what could go wrong if we ship. The executive or client wants one honest sentence and a number they can repeat in a meeting without being embarrassed later.

A report that serves only one of them fails the other two. A wall of per-frame plots buries the executive; a single green checkmark insults the engineer and hides the risk from the product owner. The fix is to write in layers, the way a good newspaper story does: a headline anyone can read, a short summary under it, then the full evidence below for whoever wants to dig. The headline is the verdict. The summary is the few numbers that matter, with their uncertainty. The body is the evidence and the caveats. Everyone reads as far down as their job requires, and nobody is misled by stopping early — which is the whole design constraint.

A QC report drawn as four stacked layers: verdict, summary numbers, evidence, and caveats plus provenance Figure 1. The anatomy of a QC report that holds up. Each layer serves a different reader, and no layer contradicts the one above it. The verdict is honest because the summary numbers below it are honest, and those are honest because the caveats and provenance are attached.

Pick a summary number that does not lie

The fastest way to lose an engineer's trust is to lead with the arithmetic mean. The number that compares a compressed video to its pristine original and predicts a human opinion score, called VMAF (Video Multi-method Assessment Fusion), is computed per frame; turning thousands of per-frame scores into one report number is a choice called pooling, and the choice changes the story. The plain mean is the most flattering and the most dangerous, because a few seconds of badly broken frames disappear into thousands of good ones (Twitter/X Engineering, Introducing VMAF percentiles, 2020).

Suppose a clip scores a mean VMAF of 93.2 — a number you would happily show a client. Now look at the rest of the distribution: the harmonic mean, which weights low frames more heavily, is 92.9; the 5th percentile is 78; the single worst frame is 71. Those low numbers are a one-second dip in a shot transition that the mean averaged into invisibility. The honest summary is not "VMAF 93" — it is "typically 93, but the worst 5% of frames fall to 78, traced to one transition." VMAF tools ship exactly the pooling options you need for this — mean, harmonic mean, median, min, and the 1st/5th/10th percentiles — precisely so a report can show the floor and not only the average (libvmaf documentation, accessed 2026). The general principle of turning per-frame scores into one number, and where each method lies, is the subject of pooling; for a report, the rule is simpler: always pair a central number with a low-percentile floor.

Pooling choice What it tells the reader Where it lies
Arithmetic mean The typical frame's quality Hides short, severe drops in a sea of good frames
Harmonic mean A central number that leans toward the bad frames Still a single number; obscures where the drop is
5th / 1st percentile The quality floor — how bad the worst frames get Says nothing about how often or how typical
Minimum The single worst frame One outlier frame can be unrepresentable noise

Table 1. No single pooled number is the whole truth. A trustworthy report shows a central number and a floor together, and says what the floor is made of.

Always show the uncertainty

Every objective metric is a model trained to approximate human opinion, so every score carries error — and a report that prints a bare number pretends to a precision the metric does not have. VMAF makes this concrete: since version 1.3.7 (June 2018) it can attach a 95% confidence interval to each prediction, computed by bootstrapping — training many models on resampled data and measuring how much their predictions disagree (Netflix VMAF documentation, VMAF Confidence Interval, accessed 2026). A VMAF of 75.4 comes out of the tool as 75.4 with a 95% interval of roughly 73.0 to 77.4 — a spread of about ±2.4 points at that score. Higher scores have a tighter interval than lower ones, because the training data is denser at the high end.

That interval is the difference between an honest comparison and a fabricated one. Here is the arithmetic that should govern every "A beats B" claim in a report:

Encoder A: VMAF 93.0, 95% CI [92.0, 94.0]
Encoder B: VMAF 93.6, 95% CI [92.5, 94.7]

Difference in means: 0.6 VMAF
Do the intervals overlap?  [92.0, 94.0] vs [92.5, 94.7]  -> YES
Conclusion: B is NOT proven better than A.

The 0.6-point lead looks like a win in a bar chart, but the intervals overlap heavily, so the report cannot claim B is better — the difference is inside the metric's own noise. As a working rule of thumb, a VMAF gap under about 1 point sits within typical measurement noise, and the perceptual literature treats roughly 6 VMAF points as one just-noticeable difference, so two encodes within a point or two are, to a viewer, the same. The standard that governs how to evaluate and report a metric's accuracy against human scores — Pearson correlation, root-mean-square error, and the outlier ratio computed against a 95% confidence interval — is ITU-T P.1401 (2020); a report that compares quality should respect its central lesson, which is that a difference smaller than the uncertainty is not a difference (ITU-T Rec. P.1401, 2020).

Visualizations that do not mislead

A chart is a visual argument, and the same techniques that clarify can deceive (Cairo, How Charts Lie, 2019). The most common deception in a quality report is the truncated axis. Edward Tufte named the measure of it: the Lie Factor, the size of the effect shown in the graphic divided by the size of the effect in the data, which should equal 1 (Tufte, The Visual Display of Quantitative Information, 2001).

Work it through with the two encoders above. Their real difference is 0.6 VMAF on a 0–100 scale. Draw the bars on a y-axis that starts at 92 and ends at 94, and B's bar stands about three times taller than A's:

Real difference:        0.6 / 100  = 0.6% of the scale
Shown on a 92-94 axis:  0.6 / 2    = 30% of the visible height
Lie Factor = 30 / 0.6 = 50

A Lie Factor of 50 means the chart exaggerates the effect fiftyfold — and it does so while every number on it is technically correct. The fix is not subtle: bar charts that encode a value by length must start at zero, because the length is the data and cutting the baseline destroys it (Correll, Bertini, Franconeri, Truncating the Y-Axis, CHI 2020). For a 0–100 metric, draw the full 0–100, or if that flattens the detail, switch to a dot-with-error-bar plot that shows the confidence interval and never implies a length. Three more rules keep a quality chart honest: plot the per-frame timeline, not only the pooled bar, so the reader sees where quality dropped; never cherry-pick the clips you chart — show the test set you actually ran; and do not lean on color alone to carry the verdict, because a red/green pair is invisible to a colorblind reader and reads as judgment rather than measurement.

The same VMAF data drawn two ways: a truncated axis exaggerating a tiny gap, and an honest full-scale plot with error bars Figure 2. The same two scores, charted twice. On a 92–94 axis the 0.6-point gap looks decisive; on the full 0–100 scale with confidence intervals it is visibly within the noise. Same data, opposite conclusion — which is exactly why the axis is a trust decision, not an aesthetic one.

Name the caveats, or the report is marketing

A metric is only as honest as the conditions you read it under, and a report that omits those conditions is selling, not measuring. Every metric has a blind spot, and the report should name the one that matters for the content in front of it. Full-reference VMAF is computed against the pristine source, so it says nothing about what the viewer actually received after delivery — a separate question answered by production monitoring. VMAF and PSNR can read "excellent" on content that has visible banding in a smooth sky, because the artifact is structurally small. A score computed with the phone model is not comparable to one from the 4K model. And whenever the metric and a careful human viewing disagree, the human is right — the objective number is a proxy that failed on that content, and the subjective test defined in ITU-R BT.500-15 and ITU-T P.910 (2023) is the ground truth it is validated against (ITU-R Rec. BT.500-15, 2023; ITU-T Rec. P.910, 2023).

The practical form of this is a short, standing "caveats and scope" block in every report: which metric and model, what it measures, what it is blind to here, and the one sentence a reader should not over-read. Counterintuitively, naming the limitation makes the report more trusted, not less — a stakeholder who sees you flag your own blind spot believes the rest of the number.

Provenance: the lines that make a number citable

A quality number with no provenance is a rumor. The same discipline that makes a benchmark methodology citable and a golden reference reproducible applies to any report someone might act on: state the content set (which clips, what resolution, how long), the metric and exact model (VMAF default v0.6.1 / phone / 4K), the pooling method, the encoder and its settings, the tool version, and the date. The broadcast industry formalizes the "what was checked" half of this in the EBU QC test-item collection — a shared database of precisely defined quality checks so that two different QC tools report the same result for the same test (EBU QC, qc.ebu.io, accessed 2026). Your report does not need their full taxonomy, but it needs their principle: a check is only trustworthy if someone else, given your provenance, would get your number. And every comparison must be apples-to-apples — same resolution, same frames, same reference, same model, same pooling — or the comparison is void no matter how clean the chart looks.

Common mistakes that destroy trust

The failure modes repeat across every team, and naming them is half the defense. Leading with the mean hides the broken frames; pair it with a low-percentile floor. A bare number with no confidence interval claims a precision the metric does not have; print the interval and never declare a winner on overlapping intervals. A truncated axis turns a 0.6-point difference into a visual cliff; start value-encoding charts at the scale's base and quote the Lie Factor to yourself before you ship the figure. No model named makes a VMAF score uninterpretable; default, phone, and 4K are different numbers. Cherry-picked clips flatter the encoder; report the full test set. And the dashboard-scale version of the same sin — trusting a green global average that hides a localized cliff — is covered in monitoring at scale; a report aggregated over everything can be as misleading as a single bad chart.

A four-row table of report mistakes, each paired with the honest fix and the reader-trust it protects Figure 3. The five report sins and their fixes. Each mistake is a way of saying more than the measurement supports; each fix is a way of saying exactly as much as it supports — no more, no less.

Where Fora Soft fits in

Fora Soft has built video streaming, OTT, conferencing, e-learning, surveillance, and telemedicine systems since 2005, and a recurring part of that work is handing a client a quality verdict they can act on without a video-engineering background. We build the report so it reads in layers — a one-line verdict for the sponsor, the summary numbers with their confidence intervals for the product owner, and the per-frame evidence and provenance for the engineer — and we hold ourselves to the honest-chart rules above, including naming the metric's blind spot for the content in question. For our own codec and encoder benchmarks, every figure carries its content set, model, settings, and date so a skeptic can check it. The goal is the one this whole section serves: a number the reader can trust enough to bet a release on.

What to read next

Call to action

References

  1. Recommendation ITU-T P.1401. "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models." International Telecommunication Union, 2020. Tier 1 (official standard). Defines how to evaluate and report a metric's accuracy against subjective scores — Pearson correlation, RMSE, the epsilon-insensitive RMSE*, and the outlier ratio against a 95% confidence interval. Basis for "a difference smaller than the uncertainty is not a difference." https://www.itu.int/rec/T-REC-P.1401-202001-I
  2. Netflix / VMAF project. "VMAF Confidence Interval" (Netflix/vmaf, resource/doc/conf_interval.md), accessed 2026-06-24. Tier 1 (metric-author primary). Since v1.3.7 (2018), each VMAF score can carry a 95% CI via bootstrapping; CI = score ± 1.96·stddev under normality, or the 2.5/97.5 percentiles otherwise; low scores have wider intervals. Basis for the "always show the uncertainty" section and the overlap rule. https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md
  3. Recommendation ITU-R BT.500-15. "Methodologies for the subjective assessment of the quality of television pictures." International Telecommunication Union, 2023. Tier 1 (official standard). The subjective ground truth every objective number is validated against; when the metric and a careful viewing disagree, the viewing wins. Basis for the caveats section and the measurement-honest framing. https://www.itu.int/rec/R-REC-BT.500
  4. Recommendation ITU-T P.910. "Subjective video quality assessment methods for multimedia applications." International Telecommunication Union, 2023 (10/2023). Tier 1 (official standard). Defines ACR, ACR-HR, DCR, and pair-comparison subjective methods and their reporting; the human-opinion benchmark a QC report ultimately answers to. Basis for "the eye is the ground truth." https://www.itu.int/rec/T-REC-P.910
  5. EBU QC. "Quality Control test-item collection" (qc.ebu.io), European Broadcasting Union, accessed 2026-06-24. Tier 3 (standards-body tooling/reference). A shared database of precisely defined QC tests so different tools report comparable results for the same check; a public API and templates. Basis for the provenance section's "a check is only trustworthy if someone else gets your number." https://qc.ebu.io/
  6. E. R. Tufte. "The Visual Display of Quantitative Information," 2nd ed. Graphics Press, 2001. Tier 6 (foundational reference). Defines the Lie Factor = (effect shown in graphic) / (effect in data), which should equal 1, and the principles of graphical integrity. Basis for the truncated-axis arithmetic. https://www.edwardtufte.com/book/the-visual-display-of-quantitative-information/
  7. M. Correll, E. Bertini, S. Franconeri. "Truncating the Y-Axis: Threat or Menace?" Proceedings of CHI 2020 (arXiv:1907.02035). Tier 5 (peer-reviewed). Empirical study of axis truncation; value-encoding charts (bars) must start at zero because length encodes the value. Basis for the "start value-encoding charts at the base" rule. https://arxiv.org/abs/1907.02035
  8. A. Cairo. "How Charts Lie: Getting Smarter about Visual Information." W. W. Norton, 2019. Tier 6 (reference). Charts mislead by using the wrong data, showing too little or too much, concealing uncertainty, or suggesting patterns; a chart is a visual argument that needs reliable data underneath. Basis for the visualization-integrity framing. https://www.goodreads.com/book/show/43726576-how-charts-lie
  9. Twitter/X Engineering. "Introducing VMAF percentiles for video quality measurements." 2020. Tier 4 (vendor engineering, credible deployer). The average VMAF can mislead into believing overall quality is high when the worst-performing frames have issues; report low percentiles alongside a central number. Basis for the "pick a summary number that does not lie" section. https://blog.x.com/engineering/en_us/topics/infrastructure/2020/introducing-vmaf-percentiles-for-video-quality-measurements
  10. Netflix / VMAF project. "VMAF FAQ and pooling options" (Netflix/vmaf, libvmaf docs), accessed 2026-06-24. Tier 1 (metric-author primary). libvmaf exposes pooling methods — mean, harmonic_mean, median, min, perc5/perc10/perc20 — so a report can show the floor and not only the average. Basis for Table 1. https://github.com/Netflix/vmaf/blob/master/resource/doc/faq.md