Why this matters

If you ship video — streaming, conferencing, surveillance, e-learning, or telemedicine — someone on your team is already making quality decisions, usually by eye and usually without a number. That works until you need to choose between two encoders, justify a bitrate, explain why last week's release looked worse, or prove to a customer that what you delivered was good. This article is the foundation for the rest of the Video Quality & Measurement section: it gives you the vocabulary and the mental model so the later deep dives on PSNR, SSIM, and VMAF land on solid ground. Read it and you will understand why a number beats an opinion — and why the number still needs to be read with care.

"Looks fine to me" is not a measurement

Start with the sentence every video team has said out loud: "the picture looks fine to me." It feels like an answer. It is not a measurement, and the gap between those two things is the whole reason this discipline exists.

A measurement has three properties that an opinion lacks. It is reproducible — run it again and you get the same answer. It is scalable — you can apply it to one clip or to a million streams without getting tired. And it is comparable — two people, two teams, or two encoders can be put on the same scale and ranked. "Looks fine to me" fails all three. It depends on who is watching, on the screen they are using, on the lighting in the room, and on whether they just watched something better or worse a minute ago.

Think of it like tasting soup for salt. One cook's "needs more salt" is a real reaction, but it does not transfer: it cannot be emailed, automated across a thousand bowls, or used to settle an argument between two kitchens. To run a kitchen at scale you need a number — grams of sodium per serving — even though the number is not the same thing as the taste. Video quality measurement is the work of building that number for "how good does this look," knowing all along that the number and the experience are related but not identical.

The reason we cannot skip straight to the number is that the human visual system does not see the way a computer counts. The eye is more sensitive to brightness than to color, more forgiving of error in busy textured regions than in smooth ones, drawn to faces and motion, and almost blind to detail it is not currently looking at. A change that doubles the raw pixel error can be invisible, while a tiny change in the wrong place — a band across a clear sky, a block on a cheek — leaps out. Any honest measurement has to respect that the target is perceived quality, not arithmetic on pixels.

What "quality" even means for compressed video

Before you can measure quality, you have to decide which quality you mean, because the word hides two different ideas.

The first is fidelity: how close is the video you delivered to some original? Compression throws information away to make files smaller, and fidelity asks how much the result drifted from the source. This is the natural question when you have a pristine master to compare against — a movie, a finished render, a recorded stream.

The second is perceptual quality: how good does the video look to a human, regardless of any original? This is the only question that matters for live broadcast, video calls, and user-generated content, where there is no pristine master sitting next to the impaired version — the camera feed is the source, already compressed, already imperfect.

These two ideas usually agree, which is why they get blurred together, but they can split apart. A modern AI upscaler can produce a frame that looks better to viewers than the low-resolution source it was built from — high perceptual quality, low fidelity, because it invented plausible detail that was never in the original. (How quality is improved with machine-learning models belongs to the AI for Video Engineering section; here we only measure the result.) Keep the distinction in your pocket: most of this section measures fidelity against a reference, but the hardest real-world cases have no reference at all.

A second subtlety is that "quality" is not one dimension. A video can be sharp but stutter, smooth but banded, perfectly encoded but delivered with a three-second freeze. Picture quality (how each frame looks) and Quality of Experience — the whole end-to-end feeling of watching, including stalls and start-up delay — are different measurements with different tools. We will separate them carefully in QoE vs QoS and in the streaming-experience block; for now, hold that "quality" spans both the frame and the playback.

The ground truth: ask humans, carefully

Because quality lives in human perception, the only true measurement is to ask humans — not casually, but under controlled conditions designed so the answer means something. This is subjective testing, and it is the bedrock the entire field rests on.

In a subjective test, a panel of viewers watches a set of video clips under specified conditions — a calibrated display, a fixed viewing distance, controlled lighting — and rates each one. The most common scale is the five-point Absolute Category Rating, where 5 is "excellent," 4 "good," 3 "fair," 2 "poor," and 1 "bad." Average everyone's scores for a given clip and you get its Mean Opinion Score (MOS) — the single number that represents how that clip looked to people. A close cousin, Differential Mean Opinion Score (DMOS), rates how degraded an impaired clip looks compared to its source, which sharpens sensitivity to small impairments.

These methods are not improvised. They are defined in international standards — chiefly ITU-T Recommendation P.910 (2023) for multimedia and ITU-R Recommendation BT.500-15 for television pictures — that specify the rating scales, the viewing conditions, the number of viewers, and how to clean the raw data. We treat subjective testing as its own block later; what matters now is the principle: when a metric and a properly run subjective test disagree, the subjective test wins. The eye is the ground truth. The metric is a model of the eye that failed on that content.

A subjective test is a carefully run taste panel. Done right, it is the most trustworthy quality answer you can get. Done wrong — too few viewers, an uncontrolled screen, leading instructions — it produces confident numbers that mean nothing, which is why running one well is a skill in its own right.

The fast proxy: objective metrics

Subjective testing has one fatal limitation for daily engineering: it is slow and expensive. You cannot convene a panel every time you tweak an encoder setting, and you certainly cannot run one on every one of the millions of video files a service encodes. So the field built objective metrics — algorithms that look at the video and compute a quality score in software, in seconds, with no humans in the loop.

An objective metric is a stand-in for the subjective score. Its entire reason to exist is to predict what a panel of humans would have said, without the cost of asking them. That framing is the most important idea in this article, so hold onto it: every objective metric is a proxy validated against subjective scores. A metric is "good" only to the degree that its numbers track real human MOS across a wide range of content. When they stop tracking, the metric is wrong and the eye is right.

The video quality measurement loop: a reference is encoded into an impaired video, humans rate it to produce a Mean Opinion Score that is the ground truth, an algorithm computes an objective score as a proxy, and the proxy is validated by how well it correlates with the human score. Figure 1. The measurement loop. Humans produce the ground-truth score (MOS); an algorithm produces a fast proxy; the proxy earns trust only by correlating with the humans.

Three objective metrics dominate practice, and you will meet each in depth in Block 2. PSNR (Peak Signal-to-Noise Ratio) measures raw pixel error on a decibel scale — fast, universal, and a weak predictor of what the eye sees. SSIM (Structural Similarity Index Measure), introduced by Wang, Bovik, Sheikh, and Simoncelli in IEEE Transactions on Image Processing in 2004, compares the structure of two images on a 0-to-1 scale and tracks perception better than PSNR. VMAF (Video Multimethod Assessment Fusion), Netflix's open-source metric, fuses several elementary measures and was trained directly on subjective scores to output a 0-to-100 number; its default model predicts quality for a 1080p living-room TV. Each is a different model of the same eye, with different strengths and different blind spots.

A metric is just arithmetic — here is the arithmetic

It helps to see, once, that a metric is not magic. PSNR is the clearest example because you can compute it by hand.

First you measure the average squared difference between the original and the compressed frame, pixel by pixel — the Mean Squared Error (MSE). Then you convert it to decibels with one formula:

PSNR = 10 · log10( MAX² / MSE )   [decibels]

MAX is the largest possible pixel value — 255 for standard 8-bit video. Suppose a compressed frame differs from its source by an average squared error of 25 (small). Plug it in:

PSNR = 10 · log10( 255² / 25 )
     = 10 · log10( 65025 / 25 )
     = 10 · log10( 2601 )
     = 10 · 3.415
     = 34.15 dB

That is the whole metric: count the pixel differences, square and average them, take a logarithm. Most "good" PSNR values land between 30 and 50 dB, and higher is better. But notice what the formula never looks at: where the errors are, whether they fall on a face or in grass, whether they form a visible band or scatter harmlessly into texture. PSNR counts wrong pixels the way a spell-checker counts wrong letters — useful, but blind to whether the sentence still reads well. That blindness is exactly why SSIM and VMAF exist, and why no single number is ever the last word.

The three measurement setups: full-, reduced-, and no-reference

There is one more split that decides which metric you can even use, and it trips up newcomers constantly. It is the question of what you have to compare against.

A full-reference metric needs the pristine original sitting next to the impaired video, frame for frame. PSNR, SSIM, and VMAF are all full-reference: they measure the difference between source and result, so they only work when you possess the source. This is the comfortable case — a video-on-demand library, an encoder test, a regression suite — where the master file is on disk.

A reduced-reference metric does not need the whole original, only a compact set of features extracted from it — a few numbers that travel alongside the stream. It is a middle ground used when sending the full reference is impractical but you can afford to carry a fingerprint of it.

A no-reference metric, also called blind, gets only the impaired video and must judge quality with nothing to compare against — the way a person can glance at a photo and say "that looks compressed" without ever seeing the original. This is the hard case, and it is also the most common one in the real world: live broadcast, a video call, a security camera, a clip uploaded by a user. There is no master, because the camera feed already is the only version that exists.

Three measurement setups side by side: full-reference compares the impaired video against the complete pristine original, reduced-reference compares against a small set of extracted features, and no-reference judges the impaired video alone with no original available. Figure 2. Full-, reduced-, and no-reference. The metric you can use is decided by how much of the original you still have.

This matters before you quote any score. Asking for a VMAF number on a live stream with no master is a category error — VMAF is full-reference, and there is nothing for it to reference. We give each setup its own treatment in full-, reduced-, and no-reference; the rule to carry now is simple: name your setup before you name your metric.

Why a single number is always a summary

Suppose you have picked a metric and run it. You now have a score. The last hard truth of this article is that the score is a summary, and summaries hide things.

A video is thousands of frames. A metric produces a score for each frame, and then squashes all of them into one headline number — a step called pooling. The most obvious way to pool is the arithmetic mean: add up the per-frame scores and divide. But the average is exactly where bad moments go to hide.

Picture two encodes of the same ten-scene clip, scored 0-to-100 per scene:

Encode A scenes: 94 93 95 92 94 93 95 93 94 93   →  mean = 93.6
Encode B scenes: 96 97 96 70 97 96 97 96 97 96   →  mean = 93.8

By the average, Encode B wins, 93.8 to 93.6. But Encode B has one scene at 70 — a visibly ugly stretch a viewer will notice and remember — while Encode A never drops below 92. The average rewarded the encode with the worse worst-case. This is why experienced teams pool with the harmonic mean or a low percentile (say, the 5th percentile, the score the worst 5% of frames fall below), both of which punish bad frames instead of letting them average away. Pooling gets its own article in Block 2; the lesson here is structural.

A per-frame quality timeline comparing two encodes with nearly identical average scores; one stays flat and high while the other has a single deep dip that the average conceals, showing why one pooled number hides bad moments. Figure 3. Same average, different experience. The pooled number hid Encode B's one terrible scene — which is the scene the viewer remembers.

Common mistake: trusting a number without reading its label. A quality score is meaningless without four pieces of context: the metric (PSNR, SSIM, VMAF), the model (VMAF default vs phone vs 4K), the pooling (mean, harmonic, percentile), and the content it was run on. "VMAF 95" tells you nothing on its own — VMAF 95 on the phone model is not VMAF 95 on the default model, and a mean of 95 can hide a scene at 70. Treat any bare score as unfinished until you know all four.

The four jobs of a measurement program

With the concepts in place, here is what measuring quality actually buys you. A measurement program does four jobs, and almost every reason to measure maps to one of them.

The first job is to compare encoders and settings. When two codecs or two configurations claim to be better, a metric run on the same content, at the same resolution, with the same reference and model, ends the argument with a number instead of a hunch. This is the apples-to-apples discipline the whole section keeps returning to.

The second job is to set a quality target and a budget. Once you can measure quality, you can decide how much is enough — "every title ships at VMAF 93 or above" — and then spend the fewest bits to get there. A higher target costs bitrate and money; a measurement lets you choose the trade-off deliberately instead of guessing.

The third job is to catch regressions. Wire a metric into your build pipeline as a quality gate, and a change that quietly makes video worse gets flagged before it reaches users, not after they complain. This turns quality from a thing you notice in production into a thing you test in CI/CD.

The fourth job is to prove delivered quality. When a customer, a partner, or a regulator asks "was the video you delivered actually good?", a defensible measurement is the answer — a documented number with its method attached, not a screenshot and a promise.

The four jobs of a video quality measurement program shown as four labelled panels: compare encoders and settings, set a quality target and budget, catch regressions in the pipeline, and prove delivered quality to a stakeholder. Figure 4. The four jobs. Nearly every reason to measure quality is one of these — and each wants the comparison done apples-to-apples.

The business case for each of these is concrete: a measured 5% bitrate saving at equal quality is a 5% cut in delivery cost across every stream, and a quality gate that catches one bad release pays for the whole program.

The metric families at a glance

Here is the landscape in one table. Read it as a map, not a verdict — each later article fills in a row.

Quality signal Scale & units What it measures Where it lies (blind spot) Reference needed
MOS (subjective) 1–5 (ACR) What humans actually perceive — the ground truth Slow, costly; needs a controlled panel to be valid A panel of viewers
PSNR decibels (dB) Raw pixel error vs the original Ignores where errors fall; weak match to the eye Full original
SSIM 0–1 Structural similarity (luminance, contrast, structure) Misses some temporal and color artifacts; single-scale Full original
VMAF 0–100 Fused, perception-trained prediction of MOS Tied to its training content; gameable by sharpening Full original
No-reference varies Quality of the impaired video with no original Hardest to trust; accuracy varies widely by content None

Table 1. The quality-signal families. MOS is the truth; the rest are proxies for it, each with a named blind spot. Always pair a metric with its reference setup and model.

The table makes the section's stance visible. There is no single "best" metric — there is the right metric for the job, read with its blind spot in mind, validated against the eye. That stance is what separates what a metric can and cannot tell you from the marketing claim that one number settles everything.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and every one of those products lives or dies on delivered quality. Measurement is how we keep the promises the encoding and streaming work makes: we pick encoders by comparing them apples-to-apples, set per-title quality targets, gate releases so a regression never reaches a viewer, and report delivered quality to clients with the method attached. The question we answer is not "does it look fine?" but "what is the number, under what conditions, and where might it mislead?" Our own benchmark data, published in Block 7 of this section, exists so you can check our method rather than take our word.

What to read next

Call to action

References

  1. ITU-T Recommendation P.910 (10/2023), Subjective video quality assessment methods for multimedia applications. International Telecommunication Union. Tier 1. Defines ACR, DCR, and CCR test methods, rating scales, and viewing conditions. https://www.itu.int/rec/T-REC-P.910-202310-I/en
  2. ITU-R Recommendation BT.500-15, Methodologies for the subjective assessment of the quality of television images. International Telecommunication Union. Tier 1. Defines subjective assessment methodology, grading scales, and viewing conditions for television pictures. https://www.itu.int/rec/R-REC-BT.500
  3. ITU-T Recommendation P.10/G.100 (2017), Amd. 2 (05/2024), Vocabulary for performance, quality of service and quality of experience. International Telecommunication Union. Tier 1. Source of the official definition of Quality of Experience (QoE). https://www.itu.int/rec/T-REC-P.10
  4. ITU-T Recommendation E.800 (09/2008), Definitions of terms related to quality of service. International Telecommunication Union. Tier 1. Source of the official definition of Quality of Service (QoS). https://www.itu.int/rec/T-REC-E.800-200809-I/en
  5. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004. Tier 1 (metric-author). The defining SSIM paper. https://ece.uwaterloo.ca/~z70wang/publications/ssim.html
  6. ITU-T Recommendation P.1401 (01/2020), Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunication Union. Tier 1. Defines PCC, SROCC, RMSE and the evaluation procedure for validating metrics against subjective scores. https://www.itu.int/rec/T-REC-P.1401
  7. Netflix, VMAF Models documentation (resource/doc/models.md), Netflix/vmaf GitHub repository, accessed 2026-06-22. Tier 3 (first-party / metric-author). Default model vmaf_v0.6.1 (1080p HDTV, 3H), phone and 4K models, and NEG mode. https://github.com/Netflix/vmaf/blob/master/resource/doc/models.md
  8. R. Rassool, "VMAF reproducibility: Validating a perceptual practical video quality metric," and Netflix Technology Blog, "Toward A Practical Perceptual Video Quality Metric" (2016). Tier 4 (deployer engineering). Origin, training, and use of VMAF at scale. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
  9. ITU-T Recommendation P.1204 series (2020), Video quality assessment of streaming services over reliable transport for resolutions up to 4K. International Telecommunication Union. Tier 1. Bitstream, pixel, and hybrid models for adaptive streaming. https://www.itu.int/rec/T-REC-P.1204