Why this matters
You run a subjective test when you need the real answer — does this encode actually look better to people — and not just a metric's guess. That answer becomes the reference your VMAF or SSIM numbers are tuned and trusted against, so if the test is wrong, the metric inherits the error and so does every encoding decision you make afterward. The trouble is that a broken subjective test does not announce itself: it returns a clean-looking table of Mean Opinion Scores with tidy averages, and only a careful reader can tell that the averages mean nothing. This article is for the streaming, encoding, or QA engineer who is about to run a quality study — in a lab or on a crowd platform — and wants to avoid the mistakes that have quietly wrecked subjective tests for decades. It is the failure-gallery capstone to the subjective-testing block: the ground-truth argument, the scales, the methods, the design, the run, and the statistics each give you the right way; this article collects the wrong ways so you can recognize them on sight.
The one idea behind every mistake
Before the gallery, the single principle that ties it together. A subjective test exists to isolate one variable — the video — so that any difference in the scores can only have come from the video. Every mistake below is a leak: some other variable (the panel, the scale, the order, the room, the analysis) sneaks into the score and you can no longer tell how much of the number is the video and how much is the leak. The fixes are not bureaucratic box-ticking. Each one plugs a specific leak that the standards bodies found, the hard way, in tests that failed. ITU-T P.910 says this in plain words: its scene-selection rules "represent years of experience and lessons learned from subjective tests that failed" (P.910 10/2023, clause 7.2.1). Read the gallery as that accumulated experience, organized.
Figure 1. The failure gallery at a glance. Eight recurring mistakes, grouped by where they corrupt the result — the panel, the scale, the session, the content, and the analysis. Each leaks a variable other than the video into the score.
Mistake 1 — Too few subjects
The most common mistake is also the most expensive to ignore: running the test on a handful of people because recruiting is slow. A subjective test estimates the average opinion of a population from a sample, and like any sample estimate it carries a margin of error — the 95% confidence interval, the band around your Mean Opinion Score (MOS) inside which the true average plausibly sits. The fewer subjects, the wider that band, and a wide band swallows the very differences you are trying to detect.
ITU-T P.910 puts hard floors under the sample size. After screening, every stimulus must be rated by at least 24 subjects in a controlled environment, and at least 35 in an uncontrolled one (P.910 10/2023, clause 10.1). The number 15, which older guidance and folklore still treat as "enough", is now explicitly the pilot-study number — for "trending or to explore modified protocols", clearly labelled as a pilot, not a result (clause 10.1). If your test panel is under 24, you do not have a result; you have a pilot.
The cost of going low is not abstract. P.910 reports the smallest mean difference the standard ACR method can reliably resolve — written ΔSCI, the difference-of-MOS confidence interval — at different panel sizes: roughly 0.5 at 24 subjects, 0.7 at 15, 1.1 at 9, and 1.5 at 6 subjects (clause 8.1.1, citing Pinson 2020). Work an example. Suppose your new encoder ladder scores MOS 3.9 and the old one 3.4 — a real, meaningful gap of 0.5 on the five-point scale. Run it past 8 subjects and your resolvable difference is about 1.1; your 0.5 gap is less than half your noise floor, so the test cannot tell the two ladders apart even though the difference is real. Run the same comparison past 24 subjects and your resolvable difference drops to about 0.5 — now the gap is right at the edge of detectability, and 30-plus subjects would put it comfortably inside reach.
Resolvable MOS difference (ΔSCI) vs panel size — ITU-T P.910 clause 8.1.1
6 subjects → ~1.5 your 0.5 gap is invisible (gap ≪ noise)
9 subjects → ~1.1 still invisible
15 subjects → ~0.7 borderline; gap < noise
24 subjects → ~0.5 just resolvable (P.910 controlled floor)
35 subjects → < 0.5 comfortably resolvable (uncontrolled floor)
The fix: recruit to the floor, not below it — at least 24 valid (post-screening) subjects for a lab test, at least 35 for an uncontrolled or crowd test, and more when you are slicing the data by demographic or environment. Violated standard: ITU-T P.910 (10/2023) clauses 10.1 and 8.1.1; ITU-R BT.500-15 (05/2023) clause 2.5.1 ("at least 15 observers … fewer than 15 … should be identified as 'informal'"). The full power-and-sample-size argument is in the statistics article.
Figure 2. Why six subjects cannot see a real difference. The resolvable difference (ΔSCI, from P.910 clause 8.1.1) shrinks as the panel grows; a genuine 0.5-MOS gap stays buried in the noise until the panel reaches the 24/35 floor.
Mistake 2 — No stabilization, no anchoring
Drop a fresh viewer straight into the first real trial and the first several ratings are unreliable. People need a few clips to calibrate their personal use of the scale — to see roughly how bad "bad" gets and how good "good" gets in this test — before their scores stabilize. Skip that and the early ratings wander, dragging the average around.
BT.500-15 builds the fix into the session structure. About five "dummy presentations" go at the start of the first session to stabilize the observers' opinion, and their data is discarded ("must not be considered in the results of the test"); about three dummies open each subsequent session (BT.500-15, clause 2.6). Separately, the test must also anchor the scale to its range. Because most assessment methods "are sensitive to variations in the range and distribution of conditions seen", the session should include the full range of conditions — or approximate it by including extreme examples, either flagged as extremes (direct anchoring) or scattered through the session unflagged (indirect anchoring) (clause 2.4). Without anchors, two panels that saw different quality ranges will use the scale differently, and their scores will not be comparable.
The fix: open every session with stabilization clips you throw away, and seed the condition set with high- and low-quality anchors so the scale is pinned to a known range. Violated standard: ITU-R BT.500-15 (05/2023) clauses 2.6 (stabilization) and 2.4 (range and anchoring).
Figure 3. Stabilize, then anchor. The first ~5 clips (grey) calibrate the viewer and are discarded; high and low anchors (green and orange) distributed through the real trials pin the scale to a known range so panels stay comparable.
Mistake 3 — Leading or inconsistent instructions
What you say to subjects before they rate becomes part of the measurement. Tell people "watch for blocking around the edges" and they will hunt for blocking and over-weight it; answer one subject's mid-session question with a hint and that subject is now running a different experiment from the rest. Instructions that nudge the rating, or that differ between people, inject the experimenter's expectation straight into the score.
The standards treat instructions as a controlled part of the protocol, not an ad-lib. ITU-T P.910 dedicates a clause to instructions and training (P.910 10/2023, clause 12.5, with sample wording in Appendix II), and the principle carried across editions is explicit: it "must not be implied that the worst quality seen in the training set necessarily corresponds to the lowest subjective grade on the scale", and questions about procedure "should be answered with care to avoid bias and only before the start of the session" (the long-standing P.910 instruction guidance). BT.500-15 adds that training sequences should demonstrate the range and type of impairments using material "other than those used in the test, but of comparable sensitivity" (BT.500-15, clause 2.5.3) — so the training primes the scale, not the specific clips.
The fix: write the instructions once, in neutral language, give every subject exactly the same script (ideally in writing), demonstrate the scale with non-test clips, and answer clarifying questions only before the session starts. Violated standard: ITU-T P.910 (10/2023) clause 12.5 and Appendix II; ITU-R BT.500-15 (05/2023) clause 2.5.3.
Mistake 4 — Sessions that run to fatigue
A tired or bored viewer is a noisy instrument. Push a panel through two hours of clips and the late ratings drift, cluster toward the middle of the scale, and stop tracking the video. Worse, long individual clips invite their own bias: when a stimulus runs long, viewers remember the start and the end more than the middle — the primacy and recency effects — and score the whole clip on those fragments.
P.910 sets limits on both the session and the clip. Each subject's participation should preferably be limited to 1.5 hours, of which no more than 1.0 hour is spent rating stimuli, with frequent breaks for anything longer (P.910 10/2023, clause 11.1). BT.500-15 is tighter on a single sitting: "a session should not last more than half an hour" (BT.500-15, clause 2.6). On clip length, P.910 intends stimuli of 4 to 20 seconds and prefers 10 seconds to 1 minute precisely because "test duration limitation also diminishes subjects' fatigue", and warns that for longer durations "it becomes difficult for viewers to take into account all of the quality variations and score properly" (clause 7.6). The detailed session-and-break structure lives in clause 12.6.
The fix: cap a single session near 30 minutes, cap a subject's total rating time near an hour, build in breaks, and keep individual stimuli short (10–30 seconds is typical). If the design needs more ratings than that allows, split it across sessions or subjects — do not extend the sitting. Violated standard: ITU-T P.910 (10/2023) clauses 11.1, 7.6, and 12.6; ITU-R BT.500-15 (05/2023) clause 2.6.
Mistake 5 — A biased or stale source set
A subjective test is only as representative as the clips it uses. Pick six easy-to-encode talking-head scenes and every codec will look great; pick six high-motion sports clips and every codec will look poor — and in both cases the result tells you about your scene choice, not about the codecs. Reusing the same favourite clips across every test compounds the problem: encoders and metrics quietly over-fit to them, and the test stops generalizing.
P.910 is unusually direct here because scene selection is where tests fail. Test scenes must be chosen so their spatial information (SI) — a measure of spatial detail — and temporal information (TI) — a measure of motion are consistent with the target service, and "the set of test scenes should span the full range of SI and TI of interest" (P.910 10/2023, clause 7.8). At least four different types of scene should be used (clause 7.8), four to six scenes are usually enough "if the variety of content is respected" (clause 7.7), and new sources should be introduced each time to avoid over-training on the same material (clause 7.2.7). Crucially, any deviation from the scene-selection advice "must describe issues where the video scene selection deviated from or contradicted the advice given here" in the report (clause 7.2.1) — the standard makes biased content a disclosable defect, not a silent one.
The fix: choose a handful of scenes that span the full spatial-detail and motion range your service actually delivers, cover at least four content types, and rotate in fresh sources rather than reusing the same clips. Violated standard: ITU-T P.910 (10/2023) clauses 7.8, 7.7, 7.2.7, and 7.2.1.
Mistake 6 — No randomization: letting order and context bias the scores
The order in which clips appear changes how they are rated. Show a badly impaired clip right after a run of pristine ones and viewers rate it lower than they otherwise would; the same clip after a run of awful ones rates higher. This is the contextual effect, and it is a real, measured bias — not a rounding error.
BT.500-15 names it precisely: "Contextual effects occur when the subjective rating of an image is influenced by the order and severity of impairments presented" (BT.500-15, Annex 5). The standard's defence is randomization plus balance: "a random order should be used for the presentations (for example, derived from Graeco-Latin squares)", arranged so that tiredness and adaptation are "balanced out from session to session" (clause 2.6). Two further rules matter — never present the same content in consecutive trials, and never fix the reference always-first in a paired method, because a fixed order builds in a known bias (P.910 10/2023, clauses 12.7.4 and 8.3). BT.500's investigation also found the contextual effect is method-dependent: it was strongest for the double-stimulus impairment-scale variant and effectively absent for the double-stimulus continuous-quality-scale (DSCQS) method (Annex 5) — so the method choice is itself part of the defence.
The fix: randomize the presentation order per subject (a different randomized playlist each), balance condition order across sessions, never show the same source back-to-back, and randomize which member of a pair comes first. Violated standard: ITU-R BT.500-15 (05/2023) clause 2.6 and Annex 5; ITU-T P.910 (10/2023) clauses 12.7.4 and 8.3.
Figure 4. Order is a variable. A quality-sorted playlist lets the contextual effect push a strong impairment's score around depending on its neighbours; a per-subject randomized, balanced order removes that leak. DSCQS minimizes the effect by design.
Mistake 7 — Skipping screening, or misapplying observer rejection
Two opposite mistakes live here. The first is letting anyone rate without checking they can see the test: a viewer with uncorrected low vision or colour-blindness is measuring something other than your impairment. BT.500-15 requires that "prior to a session, the observers should be screened for (corrected-to-) normal visual acuity on the Snellen or Landolt chart, and for normal colour vision using specially selected charts (Ishihara, for instance)" (BT.500-15, clause 2.5.2).
The second, subtler mistake is over-using the post-test rejection procedure. After the test, you may remove subjects whose scores are statistically inconsistent with the panel — but BT.500's kurtosis-based rejection has hard limits written into it: it "should not be applied more than once to the results of a given experiment", and "use of the procedure should be restricted to cases in which there are relatively few observers (e.g. fewer than 20), all of whom are non-experts" (BT.500-15, clause A1-2.3.1). Run it twice to "clean up" the data, or apply it to a 60-person crowd panel, and you are no longer screening outliers — you are sculpting the result toward the answer you wanted. For larger or crowd panels, the correct tools are the correlation-based screening and the bias-and-consistency "soft rejection" model BT.500 describes instead (clauses A1-2.3.3 and A1-2.4).
Common mistake — rejecting subjects until the data looks good. The kurtosis rejection in BT.500-15 is a one-shot screen for small non-expert panels (fewer than ~20), not a knob you turn until the confidence interval shrinks. Applying it repeatedly, or to a large or crowd panel, manufactures the result. Use correlation-based or soft (bias/consistency) screening for larger panels, apply any hard screen exactly once, and report the original and adjusted means side by side (BT.500-15 clauses A1-2.3.1, A1-2.4, 2.7).
The fix: screen acuity and colour vision before the session; after it, apply a single, pre-declared rejection rule appropriate to the panel size, and report scores before and after rejection. Violated standard: ITU-R BT.500-15 (05/2023) clauses 2.5.2 and A1-2.3.1; the running-the-test article covers screening in depth, and the crowd-specific version (gold-standard and trapping trials) is in the crowdsourcing article.
Mistake 8 — Reading MOS as an absolute number
The last mistake is the one that survives a perfectly run test: treating the Mean Opinion Score as an absolute, portable quantity. A MOS of 4.1 from your lab is not interchangeable with a MOS of 4.1 from another lab, another panel, or even your own test run with a different range of conditions. The score is a position on a scale that the test's own range and population defined; move the range and the number moves with it.
BT.500-15 states the limit directly: "because they vary with range, it is inappropriate to interpret judgements from most of the assessment methods in absolute terms" (BT.500-15, clause 2.7). There is also a measured scale-boundary effect: viewers avoid the extreme ends of the scale, so a "perfect" clip rarely averages a clean 5 (clause A1-3.3). The practical consequence is that MOS comparisons are valid within one test — the same panel, scale, range, and session — and become unreliable the moment you carry a number across to a different test. And the score is meaningless without its error bar: BT.500 requires that "for each test parameter, the mean and 95% confidence interval … must be given" (clause 2.7), alongside the test configuration, materials, display make and model, and the number and type of assessors. A bare MOS table with no confidence intervals and no method is not a reportable result.
The fix: compare conditions inside one test, never a raw MOS across tests or labs; always report the 95% confidence interval and the full method and provenance; re-anchor every session rather than assuming the scale carried over. Violated standard: ITU-R BT.500-15 (05/2023) clauses 2.7 and A1-3.3; the reporting requirements are in ITU-T P.910 (10/2023) clause 14.
Common mistake — comparing across sessions without re-anchoring. Each session's scale is re-established by its stabilization clips and anchors; scores do not automatically carry from one session or panel to the next. Re-anchor every session (the ~3 opening dummies on later sessions), repeat a few common conditions across sessions to check coherence, and compare differences within a test — not absolute numbers between tests (BT.500-15 clauses 2.6, 2.7).
Figure 5. MOS is relative, not absolute. Two sessions can place the same conditions at different absolute scores yet agree on the ranking; trust the within-test ordering and the confidence intervals, not a number carried across tests.
The failure gallery, in one table
| # | Mistake | Why it corrupts the score | The fix | Standard violated |
|---|---|---|---|---|
| 1 | Too few subjects | Wide confidence interval swallows real differences | ≥24 valid (controlled), ≥35 (uncontrolled); 15 = pilot only | P.910 §10.1, §8.1.1; BT.500-15 §2.5.1 |
| 2 | No stabilization / anchoring | Early ratings drift; scale not pinned to a range | ~5 discarded dummies per session; distribute high/low anchors | BT.500-15 §2.6, §2.4 |
| 3 | Leading / inconsistent instructions | Experimenter expectation injected into ratings | One neutral written script; demo on non-test clips; questions before only | P.910 §12.5; BT.500-15 §2.5.3 |
| 4 | Sessions run to fatigue | Late ratings noisy; long clips trigger primacy/recency | ≤30 min/session, ≤1 h rating; 10–30 s clips | P.910 §11.1, §7.6, §12.6; BT.500-15 §2.6 |
| 5 | Biased / stale source set | Result reflects scene choice, not the system | Span full SI/TI; ≥4 content types; rotate fresh sources | P.910 §7.8, §7.7, §7.2.7, §7.2.1 |
| 6 | No randomization | Order and context bias the scores | Randomize per subject; balance; no same source back-to-back | BT.500-15 §2.6, Annex 5; P.910 §12.7.4 |
| 7 | Skipped / misapplied screening | Unfit viewers, or rejection over-used to sculpt data | Snellen/Ishihara pre-screen; one-shot rejection sized to panel | BT.500-15 §2.5.2, §A1-2.3.1 |
| 8 | MOS read as absolute | Scores are range- and panel-relative, not portable | Compare within a test; always report 95% CI + provenance | BT.500-15 §2.7, §A1-3.3; P.910 §14 |
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and when a quality decision genuinely turns on human opinion, we run the subjective test as carefully as the standards demand rather than improvising it. That means sizing the panel to the P.910 floor before we start, stabilizing and anchoring every session, randomizing per viewer, screening observers, and reporting every MOS with its confidence interval and full provenance — the same discipline we apply to our benchmark methodology. The payoff is a result a client can act on and a reviewer can trust: a number that measures the video and nothing else. When a difference is too small or too display-dependent for a fast test, we say so and book the controlled version instead of pretending a noisy panel settled it.
What to read next
- The Statistics of Subjective Data
- Designing a Subjective Test That Survives Scrutiny
- Crowdsourced Subjective Testing: Speed vs Control
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your subjective testing mistakes plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union (ITU-T Study Group 12), approved 29 October 2023. Tier 1 (official standard). The controlling multimedia subjective-testing standard: minimum subjects (clause 10.1: ≥24 controlled, ≥35 uncontrolled, 15 = pilot), resolvable-difference figures (clause 8.1.1), stimulus duration (clause 7.6), session/fatigue limits (clause 11.1), scene selection and SI/TI (clauses 7.2.1, 7.7, 7.8), methods ACR/DCR/PC/ACR-HR (clause 8), instructions (clause 12.5), randomization (clause 12.7.4), and mandatory reporting (clause 14). Read directly from the ITU text on 2026-06-24. https://www.itu.int/rec/T-REC-P.910-202310-I/en
- Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector. Tier 1 (official standard). The controlling television subjective-assessment standard: observers and the "informal" rule (clause 2.5.1), screening (clause 2.5.2), range and anchoring (clause 2.4), session length and stabilization dummies (clause 2.6), absolute-interpretation warning and reporting list (clause 2.7), the kurtosis observer-rejection procedure and its one-shot / <20-non-expert limits (clause A1-2.3.1), correlation and soft rejection (clauses A1-2.3.3, A1-2.4), scale-boundary effect (clause A1-3.3), and the contextual effect (Annex 5). Read directly from the ITU PDF on 2026-06-24. https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.500-15-202305-I!!PDF-E.pdf
- Recommendation ITU-T P.808 (06/2021), "Subjective evaluation of speech quality with a crowdsourcing approach," International Telecommunication Union (ITU-T Study Group 12). Tier 1 (official standard). The crowdsourcing framework whose qualification, gold-standard and trapping questions, and multi-stage screening are the crowd-panel answer to the screening and rejection mistakes (Mistake 7). Verified on the ITU recommendation page 2026-06-24. https://www.itu.int/rec/T-REC-P.808
- Recommendation ITU-T P.913 (06/2021, deleted 2 February 2024), "Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment," International Telecommunication Union (ITU-T Study Group 12). Tier 1 (official standard). The original any-environment (uncontrolled) video methods recommendation, deleted after its content was incorporated into ITU-T P.910 (10/2023); the source of the uncontrolled-environment subject floor used in Mistake 1. Verified on the ITU recommendation page 2026-06-24. https://www.itu.int/rec/T-REC-P.913
- K. Brunnström, M. Barkowsky, "Statistical quality of experience analysis: on planning the sample size and statistical significance testing," Journal of Electronic Imaging, vol. 27, no. 5, 053013, 2018. Tier 5 (peer-reviewed). The sample-size-and-power treatment behind Mistake 1: how the panel size, the effect size, and the desired significance jointly set how many subjects a subjective test needs. https://doi.org/10.1117/1.JEI.27.5.053013
- T. Hoßfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, P. Tran-Gia, "Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing," IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 541–558, 2014. Tier 5 (peer-reviewed). The reliability machinery (qualification, gold standards, two-stage design) that prevents the crowd-panel forms of Mistakes 1 and 7, and the SOS-hypothesis framing of rating spread. https://www.keimel.org/publication/hossfeld-tom-2014/Hossfeld-TOM2014.pdf
- B. Naderi, R. Cutler, "A crowdsourced implementation of ITU-T P.910," 2022 (arXiv:2204.06784; ICASSP 2024; open-source
microsoft/P.910). Tier 5 (peer-reviewed) / first-party tooling. The open-source crowd implementation whose rater, environment, hardware, and network qualifications plus gold and trapping questions are the crowd-panel fix for the screening mistake. https://arxiv.org/abs/2204.06784 - Video Quality Experts Group (VQEG), test plans and validation methodology (HDTV, Multimedia, and the broader VQEG materials), Institute for Telecommunication Sciences (NTIA/ITS). Tier 5 (institutional). The multi-lab subjective-test methodology and validation work from which much of the P.910/BT.500 "lessons from tests that failed" guidance derives. https://its.ntia.gov/research/qoe/video-quality-research/standards/subjective-testing/
- "Subjective video quality," orientation overview of common subjective-testing pitfalls and ITU methods. Tier 6 (educational, orientation only — not a primary citation for any standard claim). https://en.wikipedia.org/wiki/Subjective_video_quality


