Why this matters
If you ship video, you will live by objective metrics, because a number is the only thing that scales to thousands of encodes — but those numbers will fool you unless you know what they stand in for. The score on your dashboard is a prediction of human opinion, not opinion itself, and it was trained and validated on someone else's content. Knowing the difference tells you when to trust the metric, when to stop and run a real subjective test, and why "raise the VMAF score" is not the same instruction as "make it look better." This is the conceptual hinge the rest of the Video Quality & Measurement section turns on: every metric in the objective-metrics block and every method in the subjective-testing block is one side of this split.
Two questions, not two answers to one question
When someone asks "is this video any good?", they are really asking one of two different things, and it pays to keep them apart.
The first question is subjective: would people who watch this say it looks good? The second is objective: when a formula is fed the pixels, does it output a high number? These are not two ways of measuring the same thing. One is a fact about human perception; the other is a fact about an algorithm. They usually agree, which is the entire point — but they are not the same, and the gap between them is where measurement mistakes live.
A kitchen analogy holds the idea in place. A subjective test is a taste panel: the only way to know whether a dish tastes good is to feed it to people and ask them, carefully. An objective metric is a sensor that measures sugar, salt, and acidity in milliseconds. The sensor is fast, cheap, and repeatable, and it is truly useful — but only because it was tuned to agree with the tasters. The tasters are the ground truth; the sensor is a convenient stand-in for them. Forget that, and you will one day optimize a recipe the sensor loves and the diners reject.
Figure 1. One video, two questions. The human answer (MOS) is the ground truth; the algorithm's answer is a proxy that only earns trust by being validated against it.
Subjective quality: the ground truth
Subjective quality is measured the only way perception can be measured directly — by asking humans. In a subjective test, people watch video clips under controlled conditions and rate what they see, and those ratings are averaged into a single number. That number, the Mean Opinion Score (MOS), is the closest thing the field has to truth, because video quality is defined by the people watching it.
The most common rating method is Absolute Category Rating (ACR), specified in ITU-T Recommendation P.910 (10/2023). A viewer sees one clip, then rates it on a five-point scale: 5 = Excellent, 4 = Good, 3 = Fair, 2 = Poor, 1 = Bad. ACR is popular because it is quick — a viewer can rate many clips in a session — though it can miss small impairments that a side-by-side method would catch, which is why other methods exist (covered in the subjective-testing methodologies article).
The conditions are not casual. ITU-R Recommendation BT.500-15 (05/2023) fixes the display, the viewing distance, and the room lighting, so that a rating means the same thing from one viewer to the next. ITU-T P.910 recommends at least 15 observers per test, because one person's opinion is noise — the signal only appears once you average enough of them. "I showed it to the team and it looked fine" is not a subjective test; it is an anecdote with a sample size of three.
Because individual opinions scatter, the MOS comes with a confidence interval, and reporting that interval is not optional.
Figure 2. How a subjective score is built. The single number 4.33 hides how much the viewers disagreed; the 95% confidence interval makes that spread visible.
Walk the arithmetic once. Suppose 15 viewers rate one clip and their scores add up to 65. The MOS is the mean:
MOS = 65 ÷ 15 = 4.33
But 4.33 is only half the answer. The viewers did not all say 4.33 — some said 5, some said 3 — and that spread matters. Taking the standard deviation of the 15 ratings (about 0.62 here) and dividing by the square root of the sample size gives a standard error of roughly 0.16. Multiplying by the Student's t value for 14 degrees of freedom (about 2.14) yields a 95% confidence interval of about ±0.34:
95% CI = 2.14 × (0.62 ÷ √15) ≈ ±0.34 → true MOS ≈ 3.99 to 4.67
So the honest result is "MOS 4.33, 95% CI ±0.34," not "MOS 4.33." A second encode scoring 4.5 is not reliably better than this one, because their intervals overlap. The confidence interval is what keeps a subjective test honest.
One refinement is worth naming. When you have the pristine original on hand, you can run ACR with Hidden Reference (ACR-HR): viewers unknowingly rate the original too, and you subtract its score from each clip's score. The result is a Differential Mean Opinion Score (DMOS), which cancels out the fact that some source content is simply more pleasant to look at, isolating the quality lost in processing. MOS and DMOS both come from humans; DMOS just removes content bias.
The catch with all of this is cost. A subjective test needs people, a controlled room, careful design, and statistics — it is slow and expensive, and you cannot run one on every encode in a pipeline that produces thousands a day. That cost is the entire reason the second kind of measurement exists.
Objective quality: the fast proxy
Objective quality is measured by an algorithm that reads the pixels and outputs a number, with no humans in the loop. The number that compares a compressed frame to the original pixel by pixel, called PSNR (Peak Signal-to-Noise Ratio), reports pixel error in decibels. The metric that looks at the structure of the image instead of raw pixel differences, called SSIM (Structural Similarity), scores from 0 to 1. The fused, machine-learned metric Netflix built, called VMAF (Video Multimethod Assessment Fusion), scores from 0 to 100. Each gets its own deep dive — PSNR, SSIM, and VMAF — in the objective-metrics block.
The appeal is obvious. An objective metric is fast, deterministic, and repeatable: feed it the same two files and it returns the same number every time, in seconds, for as many encodes as you can throw at it. That is what makes automated quality control possible — quality gates, regression tests, per-title encoding, and live monitoring all run on objective numbers, because nothing else scales.
Most of these metrics are full-reference: they need the pristine original to compare against, the way a proofreader needs the correct text to mark up a copy. Live streams and user-generated clips have no such original, which is why a whole separate family of no-reference metrics exists. But whether or not it needs a reference, every objective metric shares one defining property: it is a model of human opinion, not human opinion. Its entire value rests on how well it agrees with the people it is standing in for.
The bridge: every objective metric is validated against subjective scores
Here is the idea that ties the two questions together, and the single most important point in this article. A metric is never declared good because its math is elegant. It is declared good because, on a set of clips that humans have already rated, its output tracks the MOS. The humans come first; the metric is fit to them and then trusted to stand in for them.
The field has a standard way to measure that agreement, set out in ITU-T Recommendation P.1401 (01/2020) and used by the Video Quality Experts Group (VQEG). Three statistics do most of the work. The Pearson correlation coefficient (PCC) measures accuracy — how linearly the metric's scores follow the MOS. The Spearman rank-order correlation (SROCC) measures monotonicity — whether the metric ranks clips in the same order a human would, even if the absolute numbers differ. The root-mean-square error (RMSE) measures the typical size of the prediction error, in MOS units. A metric that scores high PCC and SROCC and low RMSE against human data is a good proxy; one that does not is rejected, no matter how clever it looks.
Figure 3. How a metric is graded. Each point is one clip; the closer the cloud hugs the diagonal, the better the metric predicts human opinion. The orange points are where the metric lies — high score, low human rating.
Two concrete examples make this real. When SSIM was introduced by Wang, Bovik, Sheikh, and Simoncelli in 2004, the paper did not just define the math — it demonstrated that SSIM tracks human ratings more closely than the older pixel-error measures, validating the new metric against subjective data. VMAF goes a step further and is trained on subjective scores directly: Netflix collected a large set of human opinion scores, treated them as ground truth, and used machine learning (a support-vector regressor) to learn a function that predicts them, reaching correlations above 0.9 with the human DMOS on its test content. In both cases the metric's authority is entirely borrowed from the humans it was fit to. Strip away the subjective data and there is nothing left to call the metric "good" against.
This is why the section is built the way it is: the objective-metrics block and the subjective-testing block are not rivals. The subjective tests produce the truth; the objective metrics are the scalable approximations of that truth, and they are only as trustworthy as their last validation against it.
When the metric and your eyes disagree — who wins
Because an objective metric is a fit to past human data, it can fail on content unlike that data. Film grain, smooth gradients that break into banding, sharp screen text, very dark scenes, heavy motion, animation — each is a known weak spot where a metric's number can drift away from what people actually see. The catalogue of where objective metrics lie is a whole article of its own.
So when the metric says 90 and a room full of viewers says the clip looks like a 70, who is right? The viewers. A properly run subjective test is the ground truth by definition, and a disagreement means the metric has hit a blind spot on that content — not that the audience is wrong. The number is a proxy that failed; the people are the thing it was trying to predict. This single rule resolves most arguments about quality: the metric informs, the humans decide.
Common mistake: optimizing for the metric instead of the viewer. The moment a score becomes a target, people tune to inflate it — and many tricks raise an objective number without improving real quality. A touch of sharpening and added contrast can lift a VMAF score while making the picture look worse to a human, which is exactly why Netflix had to publish a "no-enhancement-gain" variant, VMAF-NEG. The lesson is Goodhart's law applied to video: a measure that becomes a target stops being a good measure. Treat the score as a summary, name the model behind it, report its confidence interval, and keep a subjective check in the loop for anything that matters.
Subjective vs objective at a glance
Read this table as the map for the whole question. The two approaches are not competitors to choose between once; they are two instruments with different jobs.
| Approach | What it measures | Cost & speed | Where it lies / its limits | Best for |
|---|---|---|---|---|
| Subjective (MOS / DMOS) | What humans actually perceive — the ground truth | Slow, expensive; needs ≥15 people and a controlled setup | Hard to scale; results carry a confidence interval and can vary by panel and conditions | Calibrating metrics, settling disputes, validating a new pipeline |
| Objective (PSNR, SSIM, VMAF) | A model's prediction of human opinion from pixels | Fast, cheap, deterministic; runs on every encode | Only as good as its last validation; fails on grain, banding, text, dark, high motion, and "enhancement" tricks | Quality gates, regression tests, per-title encoding, monitoring at scale |
Table 1. The two instruments. Objective metrics scale; subjective tests tell the truth. A mature quality program uses both, with the second anchoring the first.
How they actually work together
You do not pick one of these forever. A working quality program runs objective metrics constantly and subjective tests deliberately, and the relationship between them is the whole craft.
Objective metrics are your everyday instrument. They watch every encode, fail a build when a regression drops the score below a threshold, drive per-title bitrate decisions, and monitor live quality — all the jobs that demand a number now, automatically, at scale. Subjective tests are your reference instrument. You run them when the stakes are high or the content is unusual: to validate a new encoding pipeline before you trust its numbers, to settle a "which one looks better" argument that the metric calls too close, and — most importantly — to re-anchor your metrics on your own content, since a metric validated on someone else's clips may not transfer to yours.
Figure 4. Which question are you asking? The higher the stakes and the more unusual the content, the more a subjective test earns its cost — and the answer in practice is almost always "both."
The rule of thumb is simple: the more a decision costs and the further your content sits from what a metric was trained on, the more a subjective test is worth its price. For routine encodes of ordinary content, the metric is enough. For a launch, a new codec, or a piece of content the metric keeps getting wrong, ask the humans.
Where Fora Soft fits in
Fora Soft has built video products since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and across all of them we treat objective metrics and subjective judgement as two instruments, not one. We use objective scores to run quality gates and catch regressions automatically, because that is the only thing that scales across a real pipeline. But we anchor those numbers to human perception on the content that actually matters, so a passing score on a dashboard means the picture is actually good and not merely flattering the metric. When a metric and a viewer disagree on a client's content, we trust the viewer and go find the blind spot — which is the honest way to measure quality, and the basis for the original benchmark data in our benchmark methodology.
What to read next
- QoE vs QoS — experience is not the same as the network
- Full-reference, reduced-reference, no-reference — the three measurement setups
- Why subjective testing is the ground truth
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your subjective vs objective video quality plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.910 (10/2023), Subjective video quality assessment methods for multimedia applications. International Telecommunication Union. Tier 1. Defines the ACR method, the 5-point rating scale (5 Excellent – 1 Bad), the ≥15-observer guidance, and the ACR-HR / DMOS procedure. https://www.itu.int/rec/T-REC-P.910-202310-I/en
- Recommendation ITU-R BT.500-15 (05/2023), Methodologies for the subjective assessment of the quality of television images. International Telecommunication Union. Tier 1. Grading scales and the controlled display, viewing-distance, and lighting conditions that make a subjective rating valid. https://www.itu.int/rec/R-REC-BT.500
- Recommendation ITU-T P.1401 (01/2020), Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunication Union. Tier 1. The standard procedure for validating an objective metric against subjective scores — PCC, SROCC, RMSE, and outlier ratio. https://www.itu.int/rec/T-REC-P.1401
- Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Tier 1 (metric-author). Introduces SSIM and validates it against human subject ratings, against the older error-visibility (MSE) framework. https://www.cns.nyu.edu/pub/lcv/wang03-reprint.pdf
- Netflix, VMAF — Video Multimethod Assessment Fusion (documentation and reference implementation), GitHub, and "Toward A Practical Perceptual Video Quality Metric," Netflix Technology Blog, 2016. Tier 1 (metric-author) / Tier 4 (blog). VMAF is full-reference and trained on subjective MOS/DMOS as ground truth via support-vector regression; 0–100 scale. https://github.com/Netflix/vmaf
- Recommendation ITU-T P.808 (06/2021), Subjective evaluation of speech quality with a crowdsourcing approach. International Telecommunication Union. Tier 1. The crowdsourced ACR/DCR/CCR subjective-testing methodology — defined for speech and widely adapted to scale video subjective testing when a controlled lab is not available. https://www.itu.int/rec/T-REC-P.808
- Z. Wang, A. C. Bovik, "Mean Squared Error: Love It or Leave It? A New Look at Signal Fidelity Measures," IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98–117, 2009. Tier 5 (peer-reviewed). Why MSE / PSNR correlate weakly with perceived quality and why structural and perceptual metrics replaced them. https://ece.uwaterloo.ca/~z70wang/publications/SPM09.pdf
- Video Quality Experts Group (VQEG), Final Reports and validation test plans (FRTV, Multimedia, and successors). Tier 5 (institutional). The independent body that validates objective models against subjective databases using the P.1401-style statistics. https://vqeg.org
- M. H. Pinson et al., Confidence Intervals for Subjective Tests and Objective Metrics That Assess Image, Video, Speech, or Audiovisual Quality, NTIA/ITS Technical Report TR-20-550, 2020. Tier 5 (institutional). The statistics of MOS confidence intervals and how to compare subjective and objective results. https://its.ntia.gov/publications/download/TR-20-550.pdf


