Published 2026-05-17 · 17 min read · By Nikolay Sapunov, CEO at Fora Soft

Why this matters

If your team has ever argued about whether the new codec "looks good enough", whether a bitrate ladder rung can be dropped to save 20% of egress cost, or whether the encoder vendor's marketing graph is real, you are arguing about subjective quality whether you call it that or not. Objective metrics give you a fast, cheap proxy for human judgment, but every one of them — PSNR, SSIM, MS-SSIM, VMAF, XPSNR — was trained, validated, or rejected against a database of human MOS scores collected with one of the methods in this article. A product manager who can read a subjective study and tell the difference between a 24-viewer DSCQS lab test and a 100-rater crowdsourced ACR session is the one who does not get bluffed by a vendor PDF.

This article is for the product manager, video operations lead, or founder who needs to commission, read, or argue about subjective tests. We will start from what "subjective quality" really measures, walk through the four standard methods in plain language with the timing math shown the first time it appears, cover the modern crowdsourcing recommendations (ITU-T P.808/P.913), and finish with the practical rules of thumb that decide which method to use in which situation — plus the four traps that ruin most studies in industry practice.

What "subjective quality" actually measures

The word subjective quality in this article means one thing only: the opinion a real human viewer forms after watching a video clip, expressed on a numerical scale. Not how mathematically close the compressed copy is to the original, not whether the bitrate stayed under a budget, not whether an encoder hit its target VMAF. Just: did the viewer think it looked good, and if not, how bad?

To turn that opinion into a number an engineer can act on, the industry runs a subjective test. The procedure is older than digital video itself — early television networks used it to compare picture tubes — and the modern version is standardised by two ITU recommendations that every serious lab follows.

ITU-R BT.500-14 (October 2019, the fourteenth edition since 1974) is the older of the two. 1 It was written for broadcast television and assumes you can put viewers in a controlled lab, on a calibrated CRT or LCD, at a fixed viewing distance, in a room with neutral grey walls and controlled lighting. The recommendation is in three parts: general requirements (Part 1), the family of assessment methods (Part 2), and applications to specific image formats (Part 3). BT.500 is the document people are pointing at when they say "we ran the standard test".

ITU-T P.910 (October 2023, the third major revision) is the newer recommendation and the one most modern video studies cite. 2 It was written for multimedia — internet video, phones, tablets, laptops, varied displays — and it deliberately relaxes BT.500's lab constraints. Its companion ITU-T P.913 (June 2021) goes further still: it allows tests to be run in any environment, including the viewer's home, as long as every detail of that environment is reported so the study can be reproduced. 3 Together P.910 and P.913 are how the industry runs subjective studies on streaming video in 2026.

Both recommendations exist to produce the same output number — MOS, the Mean Opinion Score — but they describe different scoring methods, different sample sizes, different rooms, and different ways of presenting clips to viewers. The choice of method changes the cost, the time, the sensitivity to small impairments, and the kind of damage you can detect. The rest of this article is a tour of those methods.

Timeline diagram showing the evolution of subjective video quality test standards from BT.500 (1974, first edition) through BT.500-14 (2019), ITU-T P.910 (1996, then 2008, 2023), ITU-T P.913 (2014, 2021), and ITU-T P.808 crowdsourcing (2021) Figure 1. Five decades of subjective quality recommendations. BT.500 is the broadcast-television root; P.910 and P.913 are the multimedia descendants that govern modern streaming studies; P.808 is the crowdsourcing branch born from speech research and now applied to video.

The MOS scale and the ACR labels

Before the methods themselves, you need the scoring scale. Every standard ITU subjective test uses the same five-point Absolute Category Rating scale, called ACR when it appears as a scoring system in its own right, and reused as the label set in every other method.

The five labels are Excellent (5), Good (4), Fair (3), Poor (2), and Bad (1). 5 The viewer sees a clip, ticks one of the five boxes, and the experimenter averages all the ticks for that clip across all viewers to compute the MOS for that clip. A clip rated as 5 by every viewer scores 5.00; a clip that exactly half the viewers called "Fair" and half called "Good" scores 3.50. The MOS sits between 1.00 and 5.00 — never higher, never lower — and the published number always comes with a confidence interval, typically the 95% interval that says how much the true average might differ from the observed one if you ran the study again with new viewers.

Two practical caveats sit underneath this neat scale, and ignoring either of them is how product teams end up arguing about meaningless numbers.

First, the labels are not equally spaced in the viewer's head. 5 The perceived "distance" from Bad to Poor is not the same as the perceived distance from Good to Excellent; subjective tests across multiple languages consistently find the gap between "Fair" and "Good" is the widest of the four. That means an MOS of 3.0 is not the perceptual midpoint between 1 and 5 — it is closer to "noticeably worse than acceptable". When you read a vendor chart that says "our encoder hits an average MOS of 3.4", remember that 3.4 is much further from "good" than the spacing on the scale suggests.

Second, MOS values from different studies are not directly comparable. The same clip shown in two different rooms, with two different viewer panels, on two different displays, can score differently by half a MOS point or more for purely contextual reasons — the lighting, the brightness range of the other clips in the same session, the calibration of the screen. A serious study reports the room, the display, the viewing distance, the panel composition, the clip order, and the session length, so other labs can repeat it. A vendor pitch that quotes an MOS without that context is using the number for marketing, not for science.

With the scale in hand, we can walk through the four methods that produce it.

DSCQS — the double-stimulus continuous scale

DSCQS, short for Double Stimulus Continuous Quality Scale, is the historical workhorse of BT.500 and the method most often cited as the gold standard for codec evaluation when small differences matter. 1 7

The procedure is built around pairs. For each test condition the viewer sees two clips back to back — one is the original, unimpaired reference, the other is the impaired version produced by your encoder, your transmission system, or whatever you are evaluating. The order inside the pair is randomised so the viewer does not know which clip is which. The viewer can replay the pair as many times as they want, then rates both clips on a continuous 0-to-100 line that is overlaid with the five ACR labels — Bad, Poor, Fair, Good, Excellent — but allows any value in between. The score for that condition is the difference between the rating given to the reference and the rating given to the impaired clip; the average difference across viewers is the Differential MOS (DMOS) for that condition. 9

Why the difference instead of the absolute rating? Because some viewers are systematically generous and some are systematically harsh, but most viewers are internally consistent — they may rate the reference at 85 instead of 95, but they will rate an impaired copy at 70 instead of 80 by the same amount. Taking the difference cancels out the viewer's personal scale and isolates the impairment. That is why DSCQS is also the method industry papers reach for when they want to compare codecs at near-transparent quality, where the impairment is small and viewer-to-viewer variance would otherwise drown the signal.

The drawback is time. A typical DSCQS session pairs each impaired clip with its reference, plays each pair twice (the viewer can request more), and gives the viewer time to rate both halves on the continuous scale. A 2003 NTIA / ITS study that compared eight methods head-to-head measured the average per-condition time at roughly 41 seconds for DSCQS against 12 seconds for the simplest ACR variant. 7 In a 60-minute session a DSCQS run gets through roughly 70 conditions; an ACR run gets through 200. For a 24-viewer panel evaluating 30 conditions, DSCQS is a half-day commitment for every viewer plus screening time; ACR is a one-hour commitment. Across hundreds of conditions the cost difference is the difference between a research project and a production pipeline.

Use DSCQS when the impairments are small, when you need to defend a near-transparent codec result against a peer reviewer, or when the absolute ordering of two very close encoder configurations matters more than throughput. Reach for something faster when you have hundreds of conditions to grind through.

Diagram comparing the four subjective test methods (DSCQS, DSIS, ACR, ACR-HR) showing for each: presentation order, scale used, time per condition, and what the method is best for Figure 2. The four standard subjective test methods compared on presentation, scale, throughput, and best-use. DSCQS and DSIS are double-stimulus (reference shown to the viewer); ACR and ACR-HR are single-stimulus (only the impaired clip is shown).

DSIS — the double-stimulus impairment scale

DSIS, short for Double Stimulus Impairment Scale, is DSCQS's faster cousin and also part of BT.500. The two share the double-stimulus idea — the viewer sees a reference and an impaired clip in each trial — but DSIS streamlines the rest of the procedure in three ways. 10

First, the pair is presented in a known, fixed order: reference always first, impaired always second. The viewer knows which is which. Second, the pair is shown only once; there is no replay loop. Third, the viewer rates the impairment — how much worse the second clip looked compared to the first — on a five-point impairment scale: Imperceptible (5), Perceptible but not annoying (4), Slightly annoying (3), Annoying (2), Very annoying (1).

That impairment scale is the same five-point structure as ACR with relabelled words, so the average score is computed the same way and the result is reported as MOS or DMOS depending on the lab's convention. The advantage of asking about impairment specifically is that the viewer's attention is locked onto the difference between the two clips — they are not being asked to judge the absolute quality of either clip, only to grade how much the second one was hurt.

DSIS is sharper than DSCQS when the impairment is large and obvious — transmission errors, severe blocking, frame drops — because the impairment scale's anchors are designed around damage rather than around "is the picture nice". It is also faster than DSCQS because the pair plays once and the scale has five steps instead of a continuous slider. The 2003 NTIA / ITS comparison measured DSIS at roughly 20 seconds per condition, half DSCQS's time. 7

Use DSIS for transmission-quality studies, error robustness tests, packet-loss simulations, and any scenario where the impairments you care about are large enough that "how bad is the damage" is the right question to ask viewers. It is also the method behind most BT.500-based research published before 2010 in the broadcast and surveillance worlds.

ACR and ACR-HR — the single-stimulus methods

ACR, short for Absolute Category Rating, drops the double-stimulus framework altogether. The viewer sees only the impaired clip — no reference, no comparison — and rates its quality on the five-point ACR scale (Excellent, Good, Fair, Poor, Bad) right after the clip ends. 2 5

That sounds like an obviously worse design — how can a viewer judge quality without a reference? — but in practice it works well for two reasons. First, viewers carry an implicit reference in their heads from a lifetime of watching television and streaming video; they know what "Good 1080p" looks like and they will rate accordingly. Second, by avoiding the reference clip, ACR doubles your throughput. The 2003 NTIA / ITS study measured the simplest 5-step ACR at 12 seconds per condition — the fastest of all the methods compared, including DSCQS at 41 seconds and DSIS at 20 seconds. 7

ACR is the default method of ITU-T P.910 and the one most multimedia studies use. The 2023 revision of P.910 broadened it further: it allows alternative rating scales (a 9-point or 11-point continuous variant exists for studies that need finer resolution), it relaxes the lab constraints from BT.500, and it accepts the modern reality that viewers are watching on phones, tablets, and laptops in non-laboratory rooms. 2

ACR-HR ("ACR with Hidden Reference") is ACR's improvement on its biggest weakness. In a standard ACR session, the unimpaired reference clip is mixed in among the impaired clips and the viewer rates it on the same scale as everything else. Because no viewer is perfectly consistent, the reference does not always score 5.00 — and the per-condition score can be corrected by subtracting the impaired rating from the reference rating, giving you a DMOS exactly like in DSCQS but without the double-stimulus throughput penalty. ACR-HR has roughly the throughput of plain ACR and the bias-cancelling property of DSCQS, which is why P.910 recommends it whenever a reference clip exists. 2

The trade-off with ACR-family methods is sensitivity at high quality. Several published comparisons report DSCQS as more sensitive than ACR when impairments are very small — the kind of "is one encoder a hair better than another" question that matters in codec research papers. 7 8 For most production decisions in 2026 — bitrate ladder design, codec selection, encoder vendor evaluation — ACR-HR's speed wins.

A worked example anchors the throughput trade-off. Suppose you have 30 conditions to evaluate, 24 viewers per condition, and a 60-minute session limit per viewer (after which fatigue degrades the data). Time math for each method:

DSCQS:  30 conditions × 41 s/condition = 1,230 s = 20.5 min of pure rating
DSIS:   30 × 20 s = 600 s = 10 min
ACR:    30 × 12 s = 360 s =  6 min

Add screening, training, instructions, breaks, and inter-clip pauses, and a 60-minute session fits comfortably with ACR, fits with DSIS, and starts to crowd with DSCQS. Push the condition count to 60 and DSCQS no longer fits in one session at all — you split the panel or extend to two days, doubling viewer recruitment cost.

How many viewers? Lab vs crowdsourcing

A subjective test is a statistics exercise as much as a video exercise, and "how many viewers do I need" is the question every commissioner asks first.

The lab answer comes from BT.500 and P.910. ITU recommends at least 15 viewers for a valid result, with 24 viewers as the common target for a controlled-environment study and 35 viewers for a public or relaxed-environment study. 6 11 Below 4 viewers the math is no longer defensible; above 40 viewers the marginal precision is not worth the cost. A 24-viewer panel with proper screening and outlier removal typically produces a 95% confidence interval of plus-or-minus 0.5 to 0.7 MOS points on a five-point scale, which is enough to separate codecs that differ by half a quality step. 6

Outlier removal is part of the procedure. BT.500 Annex 2 defines a screening method based on the kurtosis of each viewer's ratings: if a viewer's scores are too spread out or too tightly clustered relative to the rest of the panel, the procedure tags them as unreliable and removes their data from the MOS calculation. 12 ITU-T P.910's 2023 revision adds an iterative method based on Pearson correlation between each viewer and the running MOS, which removes systematically uncorrelated viewers more aggressively. Either way the rule is the same: report whom you removed and why.

The crowdsourcing answer comes from a different recommendation. ITU-T P.808 (June 2021) was originally written for speech quality testing with crowdsourcing — paying online raters to listen to clips and score them — and the same machinery has been adapted for video. 4 Crowdsourcing drops the per-rater cost by 10–100× compared to a lab, broadens the device and environment mix to something closer to real viewers' homes, and gives access to hundreds or thousands of raters instead of dozens. The trade-off is noise: rater attention, screen calibration, network conditions, and viewing distance are all uncontrolled, so each rater is noisier than a lab viewer. The empirical fix is volume — published P.808-derived studies typically require 60 to 100 valid votes per condition to land at the same precision as a 24-viewer lab study. 4 13

In 2026, the standard production pipeline for a streaming service evaluating a new codec or ladder is hybrid: a small, careful lab study under P.910 / BT.500 for the headline numbers and the validity argument, plus a large crowdsourced study under P.808 for the long tail of conditions, content types, and device categories that the lab cannot afford to cover.

The four traps that ruin most subjective tests

Across hundreds of studies in the industry and academic literature, the same four mistakes recur often enough to deserve their own callout. If you commission or read subjective studies, train your eye to spot these.

Trap 1 — comparing MOS values across studies. Two MOS numbers from two different labs, on different displays, with different viewer panels, are not the same number even if both are 3.7. The contextual factors — display brightness, room lighting, panel age and expertise, clip ordering, session length — shift the absolute MOS scale by half a point or more. A vendor chart that compares its 4.1 MOS to a competitor's 3.7 from a different study is comparing two thermometers that have not been calibrated to each other.

Trap 2 — running ACR on near-transparent impairments. ACR is fast and great for moderate to large impairments, but at the high-quality end of the scale (visually transparent codecs, two ladder rungs that differ by 50 kbps) the absence of a side-by-side reference makes viewers' implicit reference point too noisy. You will measure differences that are dominated by viewer noise rather than codec performance. Use DSCQS or DSIS for those experiments.

Trap 3 — treating MOS as an interval scale and doing arithmetic on it. Because the ACR labels are not equally spaced in the viewer's head, the average of a set of ACR ratings is technically the mean of an ordinal scale, which is mathematically suspect. 5 For most practical purposes the MOS is reported as a mean anyway because the industry has agreed to live with the approximation, but treating an MOS of 4.2 as "exactly 60% of the way from Fair to Excellent" is overinterpretation. Report the median alongside if your data are skewed; cite the 95% CI always.

Trap 4 — under-screening viewers. Real subjective studies remove unreliable viewers using the kurtosis / consistency procedure from BT.500 or the iterative LPCC procedure from P.910. 12 Studies that publish raw MOS without saying anything about outlier removal usually have one or two viewers pulling the average. Always look for the screening method in a study you are reading; if it is not there, the result has unknown reliability.

Diagram showing the four common subjective test pitfalls as a decision-tree style 'avoid this, do this instead' card, color-coded with the brand's accent colors Figure 3. The four traps that ruin most industry subjective studies, paired with the practical countermeasure for each.

A worked example: choosing the method for a real decision

Suppose your team is building a new e-learning platform and needs to decide whether to ship a 720p H.264 baseline ladder rung at 1.2 Mbps or 1.6 Mbps. The 1.2 Mbps version saves 25% of bandwidth on the slowest rung; you suspect viewers will not notice the difference.

A useful study looks like this. You assemble a panel of 24 viewers, screened for normal vision and recruited from a target audience (university students for an e-learning product). You prepare 10 source clips representative of the content — talking-head lectures, slide-and-voice screencasts, whiteboard captures, occasional 24 fps film inserts. You produce both ladder rungs for each clip, plus a hidden reference at the unimpaired source quality, for a total of 30 conditions.

Method choice: ACR-HR under ITU-T P.910 (2023) with the standard 5-point scale on a calibrated 1080p monitor at 3H viewing distance. Why ACR-HR? Because the differences you are testing (1.2 vs 1.6 Mbps on 720p H.264) are moderate, not near-transparent — viewers will probably see the artefacts on the lower rung. ACR-HR gives you throughput (12 s per condition × 30 conditions = 6 minutes of pure rating, comfortable in a 45-minute session), the bias correction of the hidden reference, and the option to compute both MOS and DMOS in the analysis.

Sample size: 24 viewers — the lab default — should give a 95% CI of plus-or-minus 0.5 MOS on the means. Outlier screening: BT.500 kurtosis method plus the P.910 LPCC iterative removal. Report the post-screening panel size in the result.

The output is a per-condition MOS with a confidence interval. If the 1.2 Mbps rung scores within 0.3 MOS of the 1.6 Mbps rung and both confidence intervals overlap, you ship the lower bitrate and save 25% of egress on that rung. If the gap is larger than 0.5 MOS with non-overlapping CIs, you keep the higher bitrate and rerun at 1.4 Mbps next quarter.

That is how a subjective study turns into a production decision instead of a debate.

Where Fora Soft fits in

Subjective video quality is a recurring problem in nearly every product we build at Fora Soft — across video conferencing, OTT and Internet TV, e-learning, telemedicine, and video surveillance. A telemedicine workflow has to convince a clinician the dermoscope feed is sharp enough to diagnose; a surveillance system has to convince an operator a face is recognisable across an entire ladder. The objective metric on the dashboard is the daily monitoring tool; the subjective study is what we run when the metric and the human disagree, when a customer reports "the new build looks worse" without numbers to back it, or when we want to compress a stream further than the metric alone would justify. We have run small, careful in-product panels and large crowdsourced studies, and the rule that has held across every project is the one in this article: be precise about the method, screen the viewers, and never compare MOS numbers across studies that were not calibrated against each other.

What to read next

Talk to us · See our work · Download

References


  1. ITU-R Recommendation BT.500-14, Methodologies for the subjective assessment of the quality of television images, October 2019. https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.500-14-201910-S!!PDF-E.pdf 

  2. ITU-T Recommendation P.910, Subjective video quality assessment methods for multimedia applications, October 2023. https://www.itu.int/rec/T-REC-P.910-202310-I/en 

  3. ITU-T Recommendation P.913, Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment, June 2021. https://www.itu.int/rec/T-REC-P.913-202106-P/en 

  4. ITU-T Recommendation P.808, Subjective evaluation of speech quality with a crowdsourcing approach, June 2021. https://www.itu.int/rec/T-REC-P.808-202106-I/en 

  5. "Mean opinion score", Wikipedia, accessed 2026-05-17. https://en.wikipedia.org/wiki/Mean_opinion_score 

  6. "Subjective video quality", Wikipedia, accessed 2026-05-17. https://en.wikipedia.org/wiki/Subjective_video_quality 

  7. M. H. Pinson and S. Wolf, Comparing subjective video quality testing methodologies, NTIA / ITS, SPIE 2003. https://its.ntia.gov/publications/download/spie03subj.pdf 

  8. Anonymous authors, Comparative Study of Subjective Video Quality Assessment Test Methodologies, arXiv:2509.20118, 2025. https://arxiv.org/pdf/2509.20118 

  9. Elecard, "Video quality measurement metrics: SSIM, PSNR, MOS, DMOS, JND", reference page accessed 2026-05-17. https://www.elecard.com/page/article_objective_video_quality_metrics 

  10. ScienceDirect topic page, "Double Stimulus Impairment Scale", accessed 2026-05-17. https://www.sciencedirect.com/topics/engineering/double-stimulus-impairment-scale 

  11. TestDevLab, "TestDevLab's Approach to Subjective MOS Video Quality Evaluation", 2024. https://www.testdevlab.com/blog/testdevlab-approach-to-subjective-mos-video-quality-evaluation 

  12. M. H. Pinson et al., The Influence of Subjects and Environment on Audiovisual Subjective Tests, IEEE Transactions on Multimedia, 2015. https://its.ntia.gov/publications/download/PinsonIEEE_TransMM_Dec2015.pdf 

  13. B. Naderi, R. Cutler et al., A crowdsourcing approach to video quality assessment, arXiv:2204.06784, 2022. https://arxiv.org/abs/2204.06784