Why this matters
If you ship video, sooner or later a number decides something: which encoder to buy, whether a new ladder looks worse, whether a quality regression is real. Most of the time you read that number off an objective metric, and most of the time that is fine — but the metric is only trustworthy because someone, somewhere, checked it against human opinion on content like yours. This article is for the video engineer, encoding lead, QA engineer, or product owner who needs to know where the ground truth comes from, when a metric is standing on solid validation and when it is guessing, and how to run the human test that settles a question a metric cannot. It is the opening article of our subjective-testing block; the encoder-operator's one-screen version lives in the Video Encoding section's subjective-testing overview, and everything here is the deep treatment behind it.
Every metric is a guess about people
Start with the thing that is easy to forget when you are staring at a VMAF score. Video quality is not a physical property of a file the way its bitrate or resolution is. Quality is a judgment that happens in a human head — whether the picture looks good to the person watching it. There is no instrument that measures "looks good" directly, the way a thermometer measures temperature. The only direct reading is to ask people.
So the field built proxies. PSNR — Peak Signal-to-Noise Ratio, the decibel measure of how much a compressed frame differs from the original pixel by pixel, explained in PSNR explained — is a proxy. SSIM — the Structural Similarity index that compares the structure of two frames, in SSIM explained — is a better proxy. VMAF — Video Multimethod Assessment Fusion, Netflix's machine-learned score in VMAF explained — is a proxy trained directly to imitate human ratings. Each one is a piece of software whose entire job is to predict what a room full of people would have said, without the room.
That makes the room the ground truth. "Ground truth" is the borrowed surveying term for the real measurement that everything else is checked against — the reading you trust when the estimates disagree. In video quality, the ground truth is a carefully run panel of human viewers, and every metric is judged by how closely it reproduces what that panel said. Think of an objective metric as a weather forecaster and the human panel as the actual weather: you do not grade the forecast by how confident it sounds, you wait for the real sky and keep score.
Figure 1. Every objective metric is a proxy. The human panel sits underneath all of them as the ground truth each one is trained and validated against.
What "quality" means when you measure it properly
Before going further, pin down the number the panel produces. When viewers rate a clip, you collect one score per viewer and average them. That average is the Mean Opinion Score, or MOS — the mean rating a panel of viewers gave a clip, usually on a five-point scale where 5 is excellent and 1 is bad. A MOS of 4.3 means the average viewer placed the clip between "good" and "excellent." It is the central output of almost every subjective test, and the subject of its own article, MOS, DMOS, and the rating scales.
A MOS is never reported alone. People disagree, so the ratings spread out, and that spread tells you how much to trust the average. A MOS of 4.3 from viewers who all rated 4 or 5 is solid; a MOS of 4.3 from viewers split between 2 and 5 is barely meaningful. The spread is captured as a confidence interval, the small ± range attached to every MOS, and the arithmetic of it — worked below — is what separates a real measurement from a show of hands.
Why every metric ultimately bows to a panel
Here is the relationship that defines this whole section. An objective metric does not have independent authority. It earns trust in exactly one way: by being shown to agree with human ratings on a database of clips, a process called validation and covered in depth in how objective metrics are validated against human scores. VMAF correlates well with people because Netflix trained it on subjective scores and then checked it against held-out ones. PSNR correlates poorly because it was never built to match perception in the first place. The difference between "well" and "poorly" is not opinion — it is measured against panels of humans.
This has a sharp consequence that surprises engineers new to measurement: when a metric and a properly run subjective test disagree, the test is right and the metric is wrong. Not "differently valid" — wrong, on that content. The metric is a model, and every model has inputs where it fails. The classic case is a smooth sky that compression breaks into visible steps, called banding: viewers see it instantly and rate the clip down, while PSNR and even VMAF can barely register it, because the pixel-level change is tiny. The metric did not find a second opinion; it missed the artifact. The panel is the appeal court with no higher court above it. Which metrics fail on which content is catalogued in where objective metrics lie, and the deeper split between the two kinds of question is in subjective vs objective quality.
What makes it a test, and not a vibe check
If asking humans is the ground truth, why not just ask your team? Because the value of a subjective test is not in the asking — it is in the controls that make the answer hold up. Strip the controls away and you are left with an anecdote. Six things turn a show of hands into a measurement, and international standards specify each one. The two that govern video are ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," and ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images."
The first control is the right viewers. A subjective test uses non-expert observers — people who do not work on video quality and will not hunt for compression artifacts out of habit. Your encoding team is the worst possible panel: they know what to look for, they know which encode is the new one, and they want it to win. ITU-T P.910 §10 calls for naive subjects screened for normal vision, precisely to keep the trained eye and the rooting interest out of the data.
The second is enough of them. One person's opinion has no error bar. The standards put the usable range at roughly 4 to 40 observers, with a practical floor around 15 for a stable result (ITU-T P.910, 2023; ITU-R BT.500-15). Five colleagues are below the floor by design — the arithmetic below shows exactly how much that costs you.
The third is a controlled environment. Viewing distance, display calibration, and room lighting all change what people see; BT.500-15 specifies them so that a score means the same thing from one session to the next. The fourth is a defined method and scale — a fixed procedure for showing clips and a labeled rating scale (the five-point Absolute Category Rating scale, or a degradation scale), so every viewer answers the same question the same way. The fifth is randomization: the order of clips is shuffled per viewer so that fatigue and learning do not pile onto whichever encode happened to go last. The sixth is screening and statistics: you remove viewers whose ratings are internally inconsistent, then report each MOS with its confidence interval. Miss any one of these and the result stops being a measurement.
Common mistake: "I showed it to the team and they liked it." This is the single most common fake subjective test. It fails at least four of the six controls at once — expert viewers instead of naive ones, far too few of them, no controlled environment, and no scale or statistics — and on top of that the team usually knows which clip is the new encode, which biases the answer before anyone looks. It is fine as a smoke test for "is anything obviously broken." It is worthless as evidence that encode B is better than encode A, because, as the next section shows with numbers, five biased viewers cannot separate two conditions that are truly close.
Figure 2. The same question — "is B better than A?" — asked two ways. Only the right-hand column produces an answer you can defend.
The arithmetic: why five viewers cannot settle it
Numbers make the point unarguable. Suppose you are choosing between two encoder settings, A and B, and you want to know if B really looks better. You run a proper Absolute Category Rating test: each viewer watches each clip and rates it 1 to 5, and you average to a MOS. Say the results come out like this:
| Condition | MOS (1–5) | Std. dev. of ratings | Viewers (N) |
|---|---|---|---|
| Encode A | 3.40 | 0.90 | 24 |
| Encode B | 3.80 | 0.90 | 24 |
The means differ by 0.4, but means alone prove nothing — you need the confidence interval. ITU-R BT.500 gives the 95% confidence interval for a MOS as the mean plus or minus a margin:
ε = 1.96 × ( S / √N )
where S is the standard deviation of the ratings and N is the number of viewers. The 1.96 is the multiplier that brackets 95% of a normal distribution. Plug in encode B's numbers:
ε = 1.96 × ( 0.90 / √24 )
= 1.96 × ( 0.90 / 4.899 )
= 1.96 × 0.1837
= 0.36
So encode B scores 3.80 ± 0.36, a 95% interval of [3.44, 4.16]. Encode A, with the same spread and count, gives ε = 0.36 as well, an interval of [3.04, 3.76]. The two intervals — [3.04, 3.76] for A and [3.44, 4.16] for B — overlap only in a thin band from 3.44 to 3.76. They do touch, and overlapping confidence intervals do not by themselves prove the encodes are equal; they mean the averages are estimated tightly enough that the real question is now answerable. Because the same viewers rated both encodes, a paired significance test — which compares each viewer's two scores and cancels their personal harshness — is far more sensitive than the intervals suggest, and on a 0.4-point gap measured this precisely it would very likely confirm that B is the better encode. The point is the precision: ±0.36 is a sharp enough estimate to settle a small difference.
Now run the same comparison the "I showed it to the team" way, with five viewers instead of twenty-four. Keep everything else identical:
ε = 1.96 × ( 0.90 / √5 )
= 1.96 × ( 0.90 / 2.236 )
= 1.96 × 0.4025
= 0.79
Encode B is now 3.80 ± 0.79, an interval of [3.01, 4.59], and encode A is 3.40 ± 0.79, an interval of [2.61, 4.19]. The two intervals overlap almost completely. The honest conclusion from five viewers is we cannot tell A and B apart — not because the encodes are identical, but because the test was too small to resolve the difference. And this is the optimistic version: the 1.96 multiplier is the normal approximation BT.500 uses by convention, but with only five viewers the stricter small-sample multiplier is the Student's t-value of about 2.78, which widens encode B's interval to ±1.12 and dissolves the comparison entirely. The full statistics of this — outlier rejection, significance testing, and how many viewers you really need — are the subject of the statistics of subjective data.
The lesson is not "more people is nicer." It is that a confidence interval scales with 1/√N, so shrinking the panel widens the uncertainty until two truly different encodes become statistically indistinguishable. Five viewers is not a small subjective test. It is not a subjective test.
When you actually need one
Running a panel is expensive in time and coordination, so the goal is not to test everything subjectively — it is to know when the metric is enough and when it is not. Reach for an objective metric, not a panel, when you are doing the everyday work the metric was validated for: comparing two encodes of the same source at the same resolution, watching for regressions in a pipeline, or setting a quality gate, all covered in choosing the right metric for the job. At scale, a metric you trust runs in milliseconds on every asset; a panel does not.
Reach for a subjective test in four situations. First, when you are validating or calibrating a metric on your own content — if your footage is unlike the streaming clips public metrics were trained on, the only way to know whether VMAF tracks your viewers is to measure both and compare. Second, when the content is a known metric blind spot: heavy film grain, banding-prone gradients, screen recordings, animation, or low-light surveillance footage, where the metric's number is least trustworthy. Third, when the decision is expensive or contested — a codec or encoder purchase, a ladder redesign, or a quality dispute with a vendor — where being wrong costs more than the test does. Fourth, when you need a golden reference: a small, well-run set of human scores that future automated checks can be anchored to. Outside those, trust the validated metric and save the panel for when the metric runs out of road.
Figure 3. A quick decision aid. Most work stays on the metric; the panel is for validation, blind-spot content, expensive calls, and building a golden reference.
The standards that make it defensible
You do not have to invent the controls — they are written down, and citing the recommendation is what makes a result defensible to a skeptical reviewer. Two recommendations carry the weight for video. ITU-T P.910 (10/2023) defines the test methods for multimedia video: Absolute Category Rating (ACR), where viewers rate single clips on the five-point scale; Degradation Category Rating (DCR), where they rate a clip against its visible original; and Pair Comparison (PC), where they pick the better of two. ITU-R BT.500-15 (05/2023) is the long-standing companion from the broadcast side, governing viewing conditions, the rating scales, and the confidence-interval arithmetic you just saw. Which method fits which job is the subject of test methodologies: ACR, DCR, PC.
Two recency notes matter, because standards move and a metric of your credibility is citing the current edition. ITU-T P.913, the older recommendation for testing "in any environment," was deleted in February 2024 and folded into the 2023 edition of P.910 — so P.910 (10/2023) now also covers tests run outside a formal lab. And crowdsourced testing, where you recruit a large remote panel over the internet, has its own discipline: the reliability mechanisms come from ITU-T P.808 (originally written for crowdsourced speech quality) adapted to video, and the trade-off between crowd speed and lab control is covered in crowdsourced subjective testing. Cite the edition, not just the number, every time.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the reason we treat subjective testing as the ground truth is that our clients' content rarely matches the streaming clips public metrics were tuned on. Surveillance footage in low light, conferencing video over a lossy link, and user-generated e-learning clips all carry artifacts a streaming-trained metric was never validated against, so a borrowed VMAF number can quietly mislead. When a quality decision matters, we confirm the metric against a small, properly screened panel on the client's own content before we let it gate a release, and we record the method, the panel size, and the confidence interval behind every figure — the discipline documented in our benchmark methodology. That is how a quality number stays a measurement instead of a hopeful guess.
What to read next
- MOS, DMOS, and the rating scales
- Test methodologies: ACR, DCR, PC, and the ITU recommendations
- How objective metrics are validated against human scores
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your video quality assessment plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, approved 29 October 2023. Tier 1 (official standard). The controlling recommendation for subjective video testing: defines ACR, DCR/DSIS, CCR/DSCS, Pair Comparison, and SAMVIQ; source stimuli; controlled vs uncontrolled environment; number and screening of subjects (§10, §12); and the data analysis including MOS/DMOS and evaluating objective metrics (§13.3). Now also incorporates the former P.913 "any environment" content. https://www.itu.int/rec/T-REC-P.910-202310-I/en
- Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector, May 2023. Tier 1 (official standard). The broadcast-side companion: viewing conditions, the grading scales, and the 95% confidence-interval method (ε = 1.96 · S/√N) applied to a MOS. https://www.itu.int/rec/R-REC-BT.500
- Recommendation ITU-T P.808 (06/2021), "Subjective evaluation of speech quality with a crowdsourcing approach," International Telecommunication Union. Tier 1 (official standard). The crowdsourcing reliability framework (gold standards, trapping/attention checks, ACR/DCR/CCR) originally written for speech and adapted by the field for crowdsourced video testing. https://www.itu.int/rec/T-REC-P.808
- ITU-T P.913 deletion notice — recommendation "Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment," deleted 2 February 2024, content incorporated into ITU-T P.910 (10/2023). Tier 1 (official standard status). The recency anchor for "cite the current edition." https://www.itu.int/rec/T-REC-P.913
- Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, M. Manohara, "Toward A Practical Perceptual Video Quality Metric," Netflix Technology Blog, June 2016. Tier 1 (metric-author defining work). Documents that VMAF is trained and validated on subjective (human) scores — the concrete proof that an objective metric's authority derives from the panel it imitates. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
- Video Quality Experts Group (VQEG), "Final Report from the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment (FRTV Phase I)," approved June 2000. Tier 1 (validation-authority primary). The field-defining demonstration that objective models are graded by how well they predict subjective ratings — and that without enough spread in the human scores, no metric can be validated. https://www.vqeg.org/projects/frtv-phase-i/
- K. Brunnström et al. (VQEG), "Perceptual video quality assessment: the journey continues!" Frontiers in Signal Processing, 2023. Tier 5 (peer-reviewed/institutional). A current overview from the validation community stating plainly that human-perceived quality is the ground truth for video quality assessment and that subjective scores are both the reference and the training data for objective metrics. https://www.frontiersin.org/articles/10.3389/frsip.2023.1193523/full
- R. R. R. Rao et al., "A crowdsourced implementation of ITU-T P.910," 2022. Tier 5 (peer-reviewed/institutional). Documents the open-source crowdsourcing adaptation of P.910 for video — the practical bridge between lab subjective testing and large remote panels, and the basis for the crowdsourcing-vs-lab discussion. https://arxiv.org/abs/2204.06784
- Recommendation ITU-T P.1401 (01/2020), "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models," International Telecommunication Union. Tier 1 (official standard). The standard that formalizes how an objective metric is graded against subjective MOS (correlation, RMSE, outlier ratio) — the procedural backbone of "the panel is the ground truth." https://www.itu.int/rec/T-REC-P.1401-202001-I/en


