Validating Video Quality Metrics Against Human Scores

Why this matters

If you read a metric's headline correlation and assume it holds for your content, you will trust a number that was never tested on anything like what you ship. A metric is only "accurate" relative to a specific set of clips, distortions, screens, and viewers — change any of those and the agreement with human opinion can fall hard. This article is for the video engineer, encoding lead, or QA engineer who reports VMAF or SSIM numbers and needs to know how much to trust them, how the metric's makers proved it works, and how to sanity-check that proof against their own use case. It assumes you have met the metrics already; the short encoder-side overview lives in our Video Encoding section's quality-metrics overview, and this article is the deep treatment of the validation behind every one of those numbers.

The ground truth is a panel of humans

Start with the thing every metric is trying to imitate. The only direct measure of video quality is to show clips to people and ask them to rate each one, then average the ratings into a single number — the Mean Opinion Score, or MOS, the average rating a panel of viewers gave a clip, usually on a 1-to-5 scale where 5 is excellent. A close cousin, the Differential MOS (DMOS), rates how much worse the compressed clip looked than its pristine original. Both come from a carefully run subjective test, the subject of why subjective testing is the ground truth and detailed in MOS, DMOS, and the rating scales.

An objective metric is a piece of software that tries to predict that MOS without the panel. That is its entire job. PSNR — Peak Signal-to-Noise Ratio, the decibel measure of pixel error explained in PSNR explained — predicts it poorly. VMAF — Video Multimethod Assessment Fusion, Netflix's machine-learned score in VMAF explained — predicts it well, because it was trained to. The difference between "poorly" and "well" is not an opinion. It is a measurement, and measuring it is what validation means.

So the question "is this metric any good?" is really the question this whole article answers: how closely do the metric's numbers track the scores real humans gave the same clips? Think of it as grading a weather forecaster. You do not grade the forecast by how confident it sounds; you wait for the actual weather and check how often the forecast matched. The metric is the forecaster, the human MOS is the actual weather, and validation is keeping score.

Validation pipeline: clips get a human MOS and a metric score, a fit aligns the scales, four statistics grade the match. Figure 1. The validation pipeline. The same clips get a human MOS and a metric score; a monotonic fit aligns the scales; then four statistics grade the agreement.

A cautionary tale: the year nobody beat PSNR

Before the statistics, a story that explains why the field takes validation so seriously. In 1999, the Video Quality Experts Group (VQEG) — a body of researchers and industry engineers who run formal metric validations — collected the best full-reference quality models of the day and tested them against fresh human ratings on television content. The final report, approved in June 2000, delivered a humbling result: of the nine models submitted, seven or eight performed about the same as one another, and all of them performed about the same as plain PSNR (VQEG FRTV Phase I Final Report, 2000).

Sit with that. The most sophisticated perceptual algorithms of the era were, statistically, no better at predicting human opinion than a formula that just counts pixel errors. The International Telecommunication Union concluded the accuracy was not good enough to standardize any of them.

There is a second lesson buried in that report, and it matters more than the first. VQEG later identified why the test could not separate the models: the test clips spanned too narrow a range of quality, so there was not enough spread in the human scores to discriminate a good metric from a bad one (VQEG FRTV Phase I, project summary). A validation is only as discriminating as the content you validate on — a point that comes back hard when we ask whether a correlation transfers to your footage. The discipline of metric validation, and the modern metrics that finally beat PSNR, grew directly out of learning to run that test properly.

The three questions every validation asks

A good validation grades a metric on three separate questions. The VQEG framework and the controlling standard, ITU-T P.1401 (2020) — "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models" — name them as monotonicity, accuracy, and consistency. Keep them separate; a metric can pass one and fail another.

Monotonicity — does it put clips in the right order? This is the most forgiving question. Forget the exact numbers; just ask whether a clip the metric scored higher really did look better to people. If the metric ranks ten clips in the same order people did, it is perfectly monotonic, even if its numbers are on a wildly different scale. The statistic here is the Spearman Rank-Order Correlation Coefficient (SROCC) — a correlation computed on the ranks of the scores rather than the scores themselves. SROCC runs from −1 to 1; 1 means identical ordering.

Accuracy — how close are the numbers? A stricter question. Here you care whether a metric score of, say, 80 lands where a MOS of 4.0 should be, not just whether bigger means better. Two statistics share this job. The Pearson Linear Correlation Coefficient (PCC) — the everyday correlation, measuring how well the points hug a straight line — runs −1 to 1, with 1 meaning a perfect linear relationship. RMSE (Root-Mean-Square Error) — the typical size of the gap between predicted and actual score, in the score's own units — is 0 when the prediction is perfect and grows as errors grow.

Consistency — how often does it badly disagree? The most practical question. A metric can have a fine average error and still blow one clip in twenty so badly that the mistake is dangerous. The Outlier Ratio (OR) captures this: the fraction of clips where the metric's error was larger than the humans' own uncertainty on that clip — specifically larger than the 95% confidence interval of the MOS (ITU-T P.1401, 2020). A low outlier ratio means the metric rarely surprises you.

Three validation questions — monotonicity, accuracy, consistency — mapped to SROCC, PCC and RMSE, and the outlier ratio. Figure 2. The three questions a validation asks, and the statistic that answers each. A metric can ace one and fail another, so all three are reported.

The step everyone forgets: the fitting curve

Here is the subtlety that separates a careful validation from a sloppy one. You cannot compute Pearson correlation or RMSE directly between a VMAF score (0–100) and a MOS (1–5) — the scales do not line up, and a metric could predict human opinion perfectly while living on a completely different number line. So before computing accuracy statistics, you first stretch and bend the metric's scores onto the MOS scale with a smooth, order-preserving curve.

That curve is a monotonic mapping, and the standard choice is a five-parameter logistic function recommended by the validation literature and ITU-R BT.500 (the subjective-assessment standard). It looks like this:

Q(s) = γ1 · logistic(γ2 · (s − γ3)) + γ4 · s + γ5

You fit the five γ parameters so the mapped metric scores Q(s) sit as close as possible to the human MOS, then compute PCC and RMSE on the mapped values. The fit only stretches the scale; it never reorders clips. That is why SROCC needs no fitting at all — ranks survive any monotonic stretch — while PCC and RMSE always do. If you ever see a Pearson number reported without mention of a fit, treat it with suspicion; the author may have correlated raw VMAF against raw MOS and gotten a misleadingly low result, or skipped a step that hides a real one.

A practical consequence: SROCC and PCC answer genuinely different questions, and a metric can score high on one and lower on the other. High SROCC with lower PCC means "the ordering is right but the spacing is off" — common, and usually fine for ranking encoders. The worked example below shows exactly that pattern.

Computing the agreement by hand

Numbers make this concrete. Suppose you measured five compressed clips, each with a VMAF score and a human MOS from a panel:

Clip	VMAF (0–100)	Human MOS (1–5)
A	60	2.0
B	70	2.8
C	80	3.3
D	88	4.0
E	95	4.6

Pearson PCC. The formula is the covariance of the two columns divided by the product of their standard deviations:

PCC = Σ(xᵢ − x̄)(yᵢ − ȳ) / √[ Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)² ]

First the means: x̄ = (60+70+80+88+95) / 5 = 78.6, and ȳ = (2.0+2.8+3.3+4.0+4.6) / 5 = 3.34. Now each clip's deviations and their products:

Clip   dx=x−78.6   dy=y−3.34   dx·dy     dx²       dy²
A       −18.6       −1.34       24.92     345.96    1.796
B        −8.6       −0.54        4.64      73.96    0.292
C         1.4       −0.04       −0.06       1.96    0.002
D         9.4        0.66        6.20      88.36    0.436
E        16.4        1.26       20.66     268.96    1.588
sum                              56.38     779.20    4.112

Plug the three sums in:

PCC = 56.38 / √(779.20 × 4.112)
    = 56.38 / √3204.07
    = 56.38 / 56.60
    ≈ 0.996

A PCC of 0.996 — very close to 1 — says the metric and the panel move together almost linearly.

Spearman SROCC. Replace each value by its rank. VMAF ranks A–E are 1, 2, 3, 4, 5; the MOS ranks are also 1, 2, 3, 4, 5, because the clip the metric scored highest is the one people scored highest, all the way down. The rankings are identical, so SROCC = 1.0 exactly.

Read the two results together. SROCC is a perfect 1.0 — the metric ordered every clip the way humans did — while PCC is 0.996, a hair under 1, because the relationship is not a perfectly straight line (the jump from clip A to B in MOS is bigger than the VMAF gap suggests). That gap between a perfect rank correlation and a near-perfect linear one is the everyday signature of monotonicity passing while linearity is slightly imperfect. With five tidy points the difference is tiny; on a real database of hundreds of clips it is where the interesting disagreements live.

The standard test databases

You cannot validate a metric on five clips you made up. You need a database: a large set of source videos, each degraded in many controlled ways, every degraded clip carrying a MOS from a real subjective test. A few public databases are the field's shared yardsticks.

The LIVE Video Quality Database, built at the University of Texas at Austin, is the classic reference: ten source scenes, 150 distorted videos, each rated by around 29 viewers, with DMOS computed after screening out unreliable raters (Seshadrinathan, Soundararajan, Bovik, Cormack, IEEE Transactions on Image Processing, 2010). Netflix publishes the NFLX dataset used to develop VMAF, and other labs maintain CSIQ, the TID series, and the 4K AVT-VQDB-UHD-1 set used to validate the bitstream models in beyond VMAF. Each pairs human scores with the clips that earned them, which is exactly what a validation needs.

On its home databases, VMAF posts strong numbers. Netflix reported a Pearson correlation of 0.963 on its own NFLX-TEST set and 0.939 on the VQEGHD3 set (reported in Rassool, RealNetworks, 2017). An independent reproduction by RealNetworks, encoding a separate 4K set with a different codec and gathering fresh subjective scores under ITU-R BT.500 double-stimulus conditions, measured a Pearson correlation of 0.948 between VMAF and DMOS — confirmation, from outside Netflix, that the metric tracks human opinion well (Rassool, RealNetworks, 2017). These are the kinds of numbers you see quoted as a metric's "accuracy."

One rule governs all of them: a metric must be validated on content it was never trained on. VMAF, P.1204.3, and every learned metric tune their parameters on some clips; testing them on those same clips inflates the score. VQEG made this a formal condition — models trained on a dataset must not be compared against models that were not, because the comparison is meaningless (VQEG FRTV Phase I, project summary). When you read a correlation, the first question is which clips produced it, and whether the metric had seen them before.

Why a headline correlation does not transfer to your content

Now the honest part, and the reason this article exists. "VMAF correlates 0.95 with human opinion" is true, conditional, and frequently misused. That 0.95 was measured on a particular database — particular scenes, particular distortions, a particular viewing setup, a particular pool of viewers. Your content is none of those things. The correlation is a property of the metric and the test set together, not of the metric alone.

Several forces pull the agreement down on unfamiliar content. The clearest is content type: a metric trained on professionally shot, compression-degraded streaming video has never learned the artifacts of screen recordings, animation, heavy film grain, or shaky user-generated footage, and its predictions on those drift. The blind-spot catalogue in where objective metrics lie is the long version of this paragraph.

A second force is range restriction — the same trap that defeated VQEG's first test. Correlation needs spread. If every clip you measure sits between MOS 4.0 and 4.6, the human scores barely vary, and even a fine metric will post a weak correlation simply because there is almost nothing to track. A low correlation on a narrow-quality set is not always the metric's fault; it can be your test design.

A third is the viewing condition and the panel themselves. A correlation measured on a calibrated reference monitor at a fixed distance with screened expert viewers may not survive a living-room television, a phone on a train, or a crowdsourced panel rating clips on whatever device they own. And the RealNetworks reproduction surfaced a directional bias worth remembering: across their 4K set, VMAF over-estimated subjective quality about 85% of the time, with an RMSE of 12.7 VMAF points (Rassool, RealNetworks, 2017) — the metric was usually optimistic, not symmetric. A single correlation number hides that kind of lean entirely; the confidence intervals discussed in VMAF in depth are how you put it back.

Common mistake: quoting one correlation as if it were the metric's verdict everywhere. A line like "we use VMAF because it correlates 0.95 with MOS" is half a sentence. Correlation with which database, under what fitting, against what viewers, over what quality range — and is the gap between 0.95 and a rival's 0.93 even statistically significant? ITU-T P.1401 specifies a significance test (a Fisher z-transformation of the correlations) precisely because two close correlations are often indistinguishable given the sample size. Report the database, the fit, and the confidence interval, or report nothing.

The validation statistics at a glance

Put the four core statistics in one place. The columns that matter most are the last two: what each one actually measures, and where it can mislead you if read alone.

Statistic	Range	What it measures	Where it lies / blind spot
SROCC (Spearman rank)	−1 to 1	Monotonicity — does the metric order clips like humans?	Ignores spacing; a metric can rank right yet be off-scale; insensitive to constant bias
PCC (Pearson linear)	−1 to 1	Accuracy of the linear fit after a monotonic mapping	Needs the fitting step; sensitive to a few outliers; range-restriction deflates it
RMSE	0 upward (MOS units)	Typical size of the prediction error	Averages away rare large misses; depends on the score scale, so not comparable across metrics
Outlier Ratio	0 to 1	Consistency — fraction of errors beyond the human 95% CI	Depends on how tight the subjective test's confidence intervals are

Read it as a panel of judges, not a single verdict. A trustworthy validation reports several of these together, with the database and the fitting method named, because each one alone has a way to flatter or punish a metric unfairly.

Bar chart: a metric's correlation with human MOS falls from its home database to different content to your own content. Figure 3. The same metric, three test sets. The headline correlation comes from the home database; on unfamiliar content the agreement falls, which is why "your content" is the bar that matters.

How to read a metric's validation claim

Putting it together, here is the skeptic's checklist for any "this metric correlates X with human opinion" claim. First, which database produced the number, and is that content anything like yours? Second, was the metric trained on it — if so, the number is inflated. Third, which statistic is it: a rank correlation forgives scale errors a Pearson number would expose. Fourth, was a fit applied before the Pearson or RMSE figure. Fifth, what was the quality range — a high correlation over a wide range is far more convincing than the same number over a narrow one. Sixth, is the difference from a rival significant, or just noise. A claim that survives those six questions is worth trusting; one that ducks them is marketing. The decision of which metric to adopt once you trust the numbers is the subject of choosing the right metric for the job.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the reason we treat validation as a first-class step is that our content rarely matches the databases the public metrics were tuned on. Surveillance footage, conferencing video, and user-generated e-learning clips carry artifacts and quality ranges that a streaming-trained metric was never validated against, so a borrowed correlation tells us little. When a number matters, we confirm the metric against a small subjective panel on the client's own content before we trust it for a release gate, and we record the database, the fitting method, and the confidence interval behind every figure in our benchmark methodology. That is how a quality number stays a measurement rather than a hopeful guess.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your video quality metric validation plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.

References

Recommendation ITU-T P.1401, "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models," International Telecommunication Union, January 2020 (approved 2020-01-13). Tier 1 (official standard). The controlling source for how a metric is graded: Pearson correlation (linearity), RMSE and epsilon-insensitive RMSE* (accuracy), the outlier ratio (consistency), and the Fisher z-transformation significance test for comparing two correlations. https://www.itu.int/rec/T-REC-P.1401-202001-I/en
Video Quality Experts Group, "Final Report from the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment (FRTV Phase I)," VQEG, March/June 2000. Tier 1 (validation-authority primary). The first formal metric validation: nine full-reference models were found statistically equivalent to one another and to PSNR, and the ITU judged the accuracy insufficient to standardize. https://www.vqeg.org/projects/frtv-phase-i/
Video Quality Experts Group, "Full Reference Television Phase I (FRTV-I)" project summary, VQEG. Tier 1 (validation-authority primary). Documents the held-out-content rule (models trained on the data must not be compared to those that were not) and the lesson that the test lacked discrimination power because its Hypothetical Reference Circuits spanned too narrow a quality range. https://www.vqeg.org/projects/frtv-phase-i/
Recommendation ITU-R BT.500 (BT.500-13 and later editions), "Methodologies for the subjective assessment of the quality of television pictures," International Telecommunication Union Radiocommunication Sector. Tier 1 (official standard). The subjective-assessment methodology that produces MOS/DMOS ground truth, and the source of the monotonic (five-parameter logistic) mapping applied before computing Pearson correlation and RMSE. https://www.itu.int/rec/R-REC-BT.500
Recommendation ITU-T P.910, "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, 2023. Tier 1 (official standard). The companion subjective-test methodology (ACR, DCR, pair comparison) whose Mean Opinion Scores are the human ground truth every objective metric is validated against. https://www.itu.int/rec/T-REC-P.910
Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, M. Manohara, "Toward A Practical Perceptual Video Quality Metric," Netflix Technology Blog, June 2016. Tier 1 (metric-author defining work). VMAF's design: an SVM fusion of VIF, DLM, and a motion feature trained on subjective MOS, and the methodology Netflix used to validate it against held-out human scores. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
K. Seshadrinathan, R. Soundararajan, A. C. Bovik, L. K. Cormack, "Study of Subjective and Objective Quality Assessment of Video," IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, June 2010. Tier 5 (peer-reviewed/institutional). The LIVE Video Quality Database: 10 sources, 150 distorted videos, ~29 subjects, DMOS computed after subject rejection — a canonical validation yardstick and its methodology. https://live.ece.utexas.edu/research/quality/live_video.html
R. Rassool, "VMAF Reproducibility: Validating a Perceptual Practical Video Quality Metric," RealNetworks (IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, 2017). Tier 5 (institutional, independent). An outside reproduction reporting VMAF–DMOS Pearson 0.948 on a 4K set (RMSE 12.7), the Netflix-reported 0.963 (NFLX-TEST) and 0.939 (VQEGHD3), and the finding that VMAF over-estimated subjective quality ~85% of the time. https://realnetworks.com/sites/default/files/vmaf_reproducibility_ieee.pdf
L. Krasula, K. Fliegel, P. Le Callet, M. Klíma, "On the accuracy of objective image and video quality models: New methodology for performance evaluation," IEEE QoMEX 2016. Tier 5 (peer-reviewed/institutional). Argues that raw correlation can mislead and proposes a significance-aware comparison based on a metric's ability to distinguish pairs of stimuli — the modern complement to the P.1401 statistics. https://nantes-universite.hal.science/hal-01395440/document
Netflix, "VMAF datasets and reproducibility" (resource/doc/datasets.md), Netflix/vmaf GitHub repository. Tier 3 (first-party tooling). The public NFLX dataset and the repository's documented procedure for training, testing, and verifying VMAF against subjective scores. https://github.com/Netflix/vmaf/blob/master/resource/doc/datasets.md

Why this matters

The ground truth is a panel of humans

A cautionary tale: the year nobody beat PSNR

The three questions every validation asks

The step everyone forgets: the fitting curve

Computing the agreement by hand

The standard test databases

Why a headline correlation does not transfer to your content

The validation statistics at a glance

How to read a metric's validation claim

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Validating Video Quality Metrics Against Human Scores

Why this matters

The ground truth is a panel of humans

A cautionary tale: the year nobody beat PSNR

The three questions every validation asks

The step everyone forgets: the fitting curve

Computing the agreement by hand

The standard test databases

Why a headline correlation does not transfer to your content

The validation statistics at a glance

How to read a metric's validation claim

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

VMAF

VQEG

PSNR

Outlier ratio

Ground truth

Confidence interval

Objective quality

ITU-R BT.500