Why this matters

Every objective metric you trust — PSNR, SSIM, VMAF — was trained and validated against MOS or DMOS, so if you do not understand what those numbers are, you do not really understand what your metric is predicting. This article is for the video engineer, encoding lead, QA engineer, or product owner who has to run or commission a subjective test, read a quality paper, or defend a result to a skeptical reviewer. It is the second article in our subjective-testing block; the first, why subjective testing is the ground truth, explains why the human panel sits above every metric, and this one explains the numbers that panel produces. The encoder-operator's one-screen version of this material lives in the Video Encoding section's subjective-testing overview; everything here is the deep treatment behind it.

MOS: the one number a test produces

Start with the number you will see most often. When a panel of viewers rates a video clip, you collect one rating per viewer and average them. That average is the Mean Opinion Score, or MOS — the mean rating a group of viewers gave a single clip. The term comes from telephony, where ITU-T P.800.1 defined it for voice quality decades before video borrowed it, and it now means the same thing for a picture: the arithmetic mean of human opinion.

The formula is as plain as it sounds. If N viewers give a clip the ratings r₁, r₂, … rₙ, the MOS is their sum divided by their count:

MOS = (r₁ + r₂ + … + rₙ) / N

Say eight viewers rate a clip 4, 5, 4, 3, 5, 4, 4, 5. The sum is 34, so the MOS is 34 ÷ 8 = 4.25. The average viewer placed this clip a quarter of the way between "good" and "excellent." That single number is the headline output of almost every subjective test, and it is what almost every objective metric is ultimately trying to predict.

A MOS reads like a school grade, and that familiarity is exactly where the first mistake hides — covered in the scale section below. For now, hold onto the plain definition: MOS is the average opinion of a panel, on a fixed scale, for one clip.

The scale the score rides on: Absolute Category Rating

A MOS is only as meaningful as the scale the viewers used, so pin the scale down before the number. The standard scale for rating video quality is the five-point Absolute Category Rating scale, or ACR — a method where each clip is shown on its own and rated independently, defined in ITU-T P.910 (10/2023), the controlling recommendation for subjective video testing. "Absolute" means the viewer judges the clip on its own merits, with no reference clip to compare against — the way a real viewer meets a video in the wild.

The five points are labelled, not just numbered (ITU-T P.910, §8.1):

Rating Label
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad

This is the scale behind the overwhelming majority of MOS numbers you will ever read. The viewer sees a clip, picks one of the five words, and the software stores the matching number. Average those numbers across the panel and you have the clip's MOS.

The ACR five-point scale, Excellent to Bad, with viewer ratings averaging into a single Mean Opinion Score. Figure 1. The ACR five-point scale. Viewers pick a labelled category; the numbers behind the labels are averaged into the clip's MOS.

There is a subtlety here that the school-grade analogy hides, and it matters for how you are allowed to do the arithmetic. The five points are an ordinal scale — the categories are ordered (excellent beats good beats fair), but the gaps between them are not guaranteed to be equal. The perceptual distance from "bad" to "poor" may not equal the distance from "good" to "excellent." Treating the labels as evenly spaced numbers is a convention the whole field accepts so it can compute a mean at all, but it is an approximation, not a law of nature. It is why a MOS of 4.0 is not "twice as good" as a MOS of 2.0, and why MOS values are best read as a ranking with spacing, never as a ratio. More on that trap below.

Two different questions: absolute quality vs degradation

Here is the conceptual fork that the rest of this article hangs on. There are two genuinely different questions you can ask a viewer, and they produce two different numbers.

The first question is "how good is this clip?" — absolute quality, judged on its own. That is what ACR measures, and its average is a MOS. The second question is "how much worse is this clip than the original?" — degradation, judged against a reference the viewer can see. Asking it changes the method and the number.

When you show the pristine source first and the processed clip second, and ask the viewer to rate the impairment of the second relative to the first, you are running Degradation Category Rating, or DCR — known in the broadcast standard ITU-R BT.500-15 as the Double Stimulus Impairment Scale (DSIS). The scale changes too, from quality words to impairment words (ITU-T P.910, §8.2):

Rating Label
5 Imperceptible
4 Perceptible, but not annoying
3 Slightly annoying
2 Annoying
1 Very annoying

The split is not academic. P.910 is explicit that ACR "may be insensitive to some impairments that are easily detected by" DCR — a slight dulling of colour, for instance, can pass unnoticed when there is nothing to compare against, but jumps out the moment the viewer sees the original beside it. DCR is the method of choice "when evaluating the fidelity of transmission with respect to the source signal," which is "frequently an important factor in the evaluation of high-quality systems" (ITU-T P.910, §8.2.1). The cost is throughput: showing two stimuli per rating produces roughly half as many ratings in the same session time.

ACR rates one clip's quality on a 5-point scale; DCR shows the reference first and rates the processed clip's impairment. Figure 2. Two questions, two scales. ACR rates a single clip's quality (Excellent–Bad); DCR/DSIS rates the impairment of a processed clip against its visible reference (Imperceptible–Very annoying).

DMOS: rating the gap, not the picture

When the question is about degradation, the average of those impairment ratings is a Differential Mean Opinion Score — DMOS, the mean of how much quality was lost, not how much remains. DMOS exists to remove a stubborn confound: in plain ACR, a viewer's rating mixes their opinion of the impairment with their opinion of the content itself, and with their personal habit of being harsh or generous. DMOS cancels those out by always measuring against a reference the same viewer rated.

There are two ways the field computes a DMOS, and confusing them is a real source of error.

The ACR-HR method: hidden-reference removal

The cleanest definition comes from ITU-T P.910's ACR with Hidden Reference (ACR-HR). You run an ordinary ACR test, but you quietly slip the pristine source clips into the set as if they were ordinary test items — the viewer rates them without knowing they are references. Then, in analysis, you subtract. For each viewer and each processed video sequence (PVS), the differential viewer score (DV) is (ITU-T P.910, §8.6.2):

DV = V(PVS) − V(REF) + 5

where V(PVS) is that viewer's ACR score for the processed clip, V(REF) is the same viewer's score for the matching hidden reference, and the + 5 shifts the result back onto a familiar 1-to-5 scale where 5 means "as good as the reference."

Walk it through. A viewer rates the hidden reference 5 (Excellent) and the processed clip 3 (Fair). Their DV is 3 − 5 + 5 = 3: the processing cost two points. Now take a harsher viewer who marks everything down — they give the same reference a 4 and the same processed clip a 2. Their DV is 2 − 4 + 5 = 3, identical. The two viewers used the scale completely differently in absolute terms, but hidden-reference removal recovers the same degradation, because it measures each viewer against their own reading of the source. That cancellation is the entire reason DMOS exists.

Average every viewer's DV for a clip and you have its DMOS. One edge case is worth knowing: a viewer can rate a processed clip higher than its reference, giving a DV above 5 — say a reference scored 4 and the processed clip 5, for a DV of 6. P.910 treats these as valid, but offers an optional "crushing" function so a few generous scores cannot drag the average up:

crushed_DV = (7 × DV) / (2 + DV),  for DV > 5

A DV of 6 crushes to (7 × 6) ÷ (2 + 6) = 42 ÷ 8 = 5.25, pulled gently back toward the top of the scale instead of floating free. Notice the direction here: in ACR-HR, higher DMOS means better quality, because the + 5 keeps it aligned with the ACR scale.

The database method: difference scores, Z-scored

The other DMOS you will meet lives in the academic quality databases that metrics are trained on — the LIVE Video Quality Database and the datasets behind VMAF. Here the difference is computed the opposite way round: a per-subject difference score subtracts the processed rating from the reference rating, so a bigger number means a bigger drop. Those raw differences are then standardised into Z-scores per viewer (subtract the viewer's mean, divide by their standard deviation) and linearly rescaled, commonly onto a 0–100 range. The reference clips collapse to zero difference and drop out.

The crucial consequence: in this convention, higher DMOS usually means worse quality — the opposite of the ACR-HR direction. Same three letters, inverted meaning. Netflix trained VMAF on subjective scores of exactly this kind, which is why "a VMAF point predicts a DMOS point" is only coherent once you know which DMOS convention and which direction are in play. The content where a trained metric and the panel part ways is catalogued in where objective metrics lie.

From raw ratings to a defensible score

Whether you end up with a MOS or a DMOS, the average is the easy part. A bare average is a vibe with a decimal point; the steps around it are what make it a measurement. Three of them matter most.

Screen the viewers first. Some people rate inconsistently, misunderstand the task, or click through without watching. P.910 and ITU-R BT.500-15 both define post-experiment screening — typically correlating each viewer's ratings against the panel average and removing those who fall too far out of line — so that unreliable raters do not pollute the mean. You compute the MOS after screening, never before. The full machinery of screening and outlier rejection is the subject of the statistics of subjective data.

Attach a confidence interval. People disagree, and the spread of their ratings tells you how much to trust the average. The spread is summarised by the standard deviation of the opinion scores, and ITU-R BT.500-15 turns it into a 95% confidence interval — the small ± range every MOS must carry:

ε = 1.96 × ( S / √N )

where S is the standard deviation of the ratings, N is the number of viewers, and 1.96 is the multiplier that brackets 95% of a normal distribution. For a clip rated by 24 viewers with a typical spread of S = 0.7, the margin is ε = 1.96 × (0.7 ÷ √24) = 1.96 × 0.143 = 0.28, so a MOS of 4.10 is reported as 4.10 ± 0.28. Report the MOS without that interval and you have hidden exactly the information a reader needs to know whether two clips really differ. (Why the panel must be large enough for that interval to be useful — and why five colleagues never are — is worked out in why subjective testing is the ground truth.)

State the conditions. A MOS is meaningless in isolation. The same clip can score differently on a phone and a calibrated monitor, on a 5-point and an 11-point scale, under ACR and DCR. A defensible result names the method (ACR, DCR, ACR-HR), the scale, the number of screened viewers, and the viewing conditions, alongside the number itself.

Pipeline: screen unreliable viewers, average ratings into MOS or DMOS, then attach a 95% confidence interval. Figure 3. The path from raw ratings to a result you can defend: screen, average, then attach the confidence interval. Skip a step and the number stops being a measurement.

How precise can a well-run ACR test actually get? P.910 reports an empirical answer worth memorising. It defines the smallest MOS gap a test can reliably resolve and notes that ACR "rarely yields" a resolvable difference below 0.5 points for 24 subjects, 0.7 for 15, 1.1 for 9, and 1.5 for 6 (ITU-T P.910, §8.1.1). Read that as a floor on ambition: with two dozen viewers, do not expect to defend a MOS difference much smaller than half a point. It is the most concrete sensitivity guidance the standard gives, and most articles never mention it.

MOS or DMOS — which number, and what it hides

The two numbers answer different questions, ride on different scales, and even run in different directions. Pick deliberately.

Comparison of MOS, ACR-HR DMOS, and database DMOS by question, scale, and direction (higher = better vs worse). Figure 4. The same three letters, two directions. ACR-HR DMOS climbs with quality; the database convention climbs with degradation — the table below gives the full breakdown.

MOS (via ACR) DMOS (via ACR-HR) DMOS (database / DSIS)
Question asked How good is this clip? How much worse than the reference? How much worse than the reference?
Method Single stimulus (ACR) Hidden reference, then subtract Difference score, Z-scored
Scale 1–5 (Excellent–Bad) ~1–5, reference-anchored often 0–100
Direction Higher = better Higher = better Higher = worse
What it measures Absolute perceived quality Quality lost, viewer bias removed Quality lost, viewer bias removed
Where it lies Confounds content liking with impairment Larger confidence intervals than ACR; needs an excellent reference Direction and rescaling vary by dataset

Two honest caveats sit in that "where it lies" column. ACR's absolute number is contaminated by how much the viewer liked the content, so a beloved clip can out-score a better-encoded but duller one. ACR-HR removes that contamination but, P.910 warns, "may result in larger confidence intervals than ACR" — you buy bias removal with a little precision, and the method only works if the hidden reference is genuinely excellent or good, since a flawed reference distorts every difference computed from it.

The common mistakes

Rating-scale errors are quiet — they produce numbers that look fine and mean the wrong thing. Five recur often enough to name.

Mistake 1 — treating MOS as a ratio. A MOS of 4 is not "twice as good" as a 2, and a jump from 4.0 to 4.5 is not the same perceptual step as 2.0 to 2.5. The scale is ordinal; the numbers rank with rough spacing, they do not multiply. Compare MOS values by difference and significance, never by ratio.

Mistake 2 — reporting a MOS with no confidence interval. A MOS of 4.10 alone is unfalsifiable. A MOS of 4.10 ± 0.28 can be checked against another clip. The interval is not decoration; it is the part that makes the number evidence.

Mistake 3 — mixing MOS and DMOS, or two DMOS directions. A 4.2 MOS and a 4.2 ACR-HR DMOS are different quantities, and a "DMOS of 80" might mean excellent (some scales) or terrible (database convention). Always state the method and the direction before comparing.

Mistake 4 — comparing scores across different scales or methods. A MOS from a 5-point ACR test and a MOS from an 11-point or continuous-scale test are not interchangeable, and an ACR MOS cannot be compared to a DCR result clip-for-clip. Apples to apples means same scale, same method, same conditions.

Mistake 5 — too few viewers to resolve the difference. Per P.910's own numbers, six viewers cannot defend a gap below about 1.5 points. If your panel is small, your "winner" may be noise — check the gap against the resolvable floor for your sample size.

How many points should the scale have?

A natural instinct is that more scale points mean more precision, so an 11-point or continuous slider must beat the 5-point scale. The evidence says otherwise. P.910 discusses 9-point and 11-point variants that keep the same five anchor words (excellent, good, fair, poor, bad), and the SAMVIQ method uses a continuous 0–100 slider — but the recommendation is blunt that for continuous scales "the accuracy of the resulting MOS does not improve" while "the method becomes more difficult for subjects," who anchor to the labels and effectively use a handful of levels anyway (ITU-T P.910, §8.7.1). The standard cites support for scales of five to nine levels. The practical reading: the familiar 5-point ACR scale is not a limitation to engineer around — it is close to the sweet spot of what viewers can actually discriminate.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and on the projects where quality is contested, the choice between MOS and DMOS is the first decision we make, not an afterthought. When a client asks "is the new encoder better," we usually reach for DMOS with a hidden reference, because it strips out how much each viewer happened to like the test footage and measures only what the encoder changed. When the question is "is this good enough to ship to viewers," absolute MOS on the target device is the honest measure. Either way, we report the method, the screened panel size, and the confidence interval behind every figure — the discipline documented in our benchmark methodology — so the number survives the meeting it is quoted in.

What to read next

Call to action

References

  1. Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, approved 29 October 2023. Tier 1 (official standard). The controlling recommendation for subjective video testing: defines the ACR 5-point scale (§8.1), the DCR/DSIS impairment scale (§8.2), CCR (§8.3), and ACR-HR with the differential-viewer-score formula DV = V(PVS) − V(REF) + 5 and the crushing function (§8.6.2); the ACR sensitivity thresholds (ΔSCI 0.5/0.7/1.1/1.5 for 24/15/9/6 subjects, §8.1.1); the 9/11-point and continuous-scale guidance (§8.7.1); and subject counts (§10.1). Incorporates the former P.911 and P.913. https://www.itu.int/rec/T-REC-P.910-202310-I/en
  2. Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector, May 2023. Tier 1 (official standard). The broadcast-side companion: the Double Stimulus Impairment Scale (DSIS, the broadcast name for DCR), the grading scales, post-experiment subject screening, and the 95% confidence-interval method (ε = 1.96 · S/√N) applied to a MOS. https://www.itu.int/rec/R-REC-BT.500
  3. Recommendation ITU-T P.800.1 (07/2016), "Mean opinion score (MOS) terminology," International Telecommunication Union. Tier 1 (official standard). The recommendation that standardises MOS terminology and notation (its origin in telephony, the LQS/CQS qualifiers), establishing that MOS is the arithmetic mean of subjective opinion on a defined scale — the definitional source for the term video borrowed. https://www.itu.int/rec/T-REC-P.800.1
  4. K. Seshadrinathan, R. Soundararajan, A. C. Bovik, L. K. Cormack, "Study of Subjective and Objective Quality Assessment of Video" (the LIVE Video Quality Database), IEEE Transactions on Image Processing, vol. 19, no. 6, 2010. Tier 5 (peer-reviewed). The reference methodology for database-style DMOS: per-subject difference scores between reference and processed clips, Z-score normalisation, and linear rescaling — the convention in which higher DMOS means more degradation. https://live.ece.utexas.edu/research/quality/live_video.html
  5. Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, M. Manohara, "Toward A Practical Perceptual Video Quality Metric," Netflix Technology Blog, June 2016. Tier 1 (metric-author defining work). Documents that VMAF is trained on subjective scores (DMOS-style differential opinion) — the concrete link between a rating scale and the objective metric it produces, and the reason DMOS direction matters when reading a VMAF claim. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
  6. Recommendation ITU-T P.913 (deletion notice), International Telecommunication Union; recommendation deleted 2 February 2024, content incorporated into ITU-T P.910 (10/2023). Tier 1 (official standard status). The recency anchor: the "any environment" subjective-testing methods, plus the former P.911, are now consolidated into P.910 (10/2023) — cite the current edition. https://www.itu.int/rec/T-REC-P.913
  7. S. Winkler, "Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives," Multimedia Systems, 2016. Tier 5 (peer-reviewed). A practitioner-grade treatment of what MOS does and does not support — the ordinal-scale caution, the standard-deviation-of-opinion-scores (SOS) behaviour, and why MOS must be read with its spread, not as a precise point. https://stefan.winklerbros.net/Publications/mmsj2016.pdf
  8. Recommendation ITU-T P.1401 (01/2020), "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models," International Telecommunication Union. Tier 1 (official standard). Formalises how an objective metric is graded against subjective MOS/DMOS (Pearson correlation, RMSE, outlier ratio), and the statistical treatment — confidence intervals and significance — that a reported MOS must carry to be comparable. https://www.itu.int/rec/T-REC-P.1401-202001-I/en
  9. Recommendation ITU-T P.808 (06/2021), "Subjective evaluation of speech quality with a crowdsourcing approach," International Telecommunication Union. Tier 1 (official standard). The crowdsourcing reliability framework (gold standards, trapping/attention checks, ACR/DCR/CCR) originally written for speech and adapted by the field for crowdsourced video MOS collection; relevant to how raw ratings are screened before averaging. https://www.itu.int/rec/T-REC-P.808