Why this matters
The most expensive failure in subjective testing is a test that runs cleanly, produces tidy numbers, and is then taken apart by the first person who asks "but what content did you test, and on what screen?" — because by then the panel has gone home and the money is spent. This article is for the video engineer, QA lead, or codec evaluator who has to design a test others will cite, or who must read someone else's quality study and judge whether the design supports the claim. It is the build article in our subjective-testing block: the previous piece, test methodologies: ACR, DCR, PC, and the ITU recommendations, explained which question to ask viewers; this one covers everything around that question, and the next, running the test: participants, screening, and environment, covers execution. The encoder-operator's one-screen version lives in the Video Encoding section's subjective-testing overview; everything here is the deep treatment behind it.
What "survives scrutiny" actually means
A subjective test produces a number — a Mean Opinion Score, the average rating a panel of viewers gave a clip, explained in full in MOS, DMOS, and the rating scales. That number will be used to make a decision: ship this encoder, reject that bitrate ladder, trust this metric. And because it drives a decision, someone will eventually challenge it.
"Survives scrutiny" means the design answers every challenge before it is raised. It is not a question of running fancier statistics afterward; bad design cannot be rescued by good arithmetic. If you tested only slow-moving talking-head clips, no confidence interval will tell you how the encoder handles sports. If half your viewers watched on a calibrated monitor and half on a phone in sunlight, no outlier rejection will separate the codec's effect from the room's.
ITU-T P.910 makes this concrete in its final clause. Section 14, "Mandatory information to report on a subjective test," lists what a defensible test must document: the source content, the processing conditions, the environment, the subjects and how they were screened, the method, and the analysis (ITU-T P.910 §14). The clever inversion is to read that list before you design. Every item P.910 says you must report is a design decision you must make deliberately. If you cannot fill in the report, the test is not defensible — so design the test to make the report writable.
The rest of this article is that list, in build order.
The vocabulary: SRC, HRC, PVS
Three abbreviations organize every subjective test, and getting them straight makes everything downstream easier.
The source is the pristine, original clip before any processing — the reference. In the literature it is the SRC (source reference). It is the best the content will ever look; everything you do to it can only subtract quality.
The condition is a specific processing recipe you want to evaluate — a codec at a bitrate, a resolution, a packaging step, a network impairment. The standard name is the HRC (hypothetical reference circuit): "a fixed combination of a video encoder operating at a given bit-rate, a network condition, and a video decoder." An HRC is the thing under test.
The processed clip that a viewer actually watches and scores is the PVS (processed video sequence) — the result of running one SRC through one HRC. Every clip in the test is a PVS.
The whole test is therefore a grid: each source crossed with each condition produces one processed clip. Test six sources against eight conditions and you have a 6 × 8 grid of forty-eight clips to score. This SRC-by-HRC matrix is the skeleton of the design, and almost every design decision is about choosing its rows, its columns, and how viewers move through its cells.
Figure 1. The test is a matrix. Rows are source clips (SRC), columns are processing conditions (HRC), and every cell is one processed clip (PVS) a viewer rates. The full grid is the full-factorial design.
Choosing the source set: the ceiling you can never raise
The source clips set the ceiling on what the test can see, and a biased source set is the most common way a test quietly fails. ITU-T P.910 devotes its entire Section 7 to source stimuli, and the guidance reduces to one idea: the sources must be representative of the content the system will really carry, and varied enough to expose the system's weak spots.
Start with content diversity. P.910 warns against a source set that is too narrow in subject matter (§7.2.3) and explicitly cautions against convenience sampling — grabbing whatever footage is on hand because it is easy (§7.2.7). The trap is concrete: a video conferencing codec tested only on a single talking head will look excellent, then fall apart in production the first time someone shares a fast-scrolling screen or stands in front of a moving background. You did not test the hard case, so you never saw the failure.
Then coding complexity. P.910 calls out that sources differ in how hard they are to compress (§7.2.2). A static interview compresses to nothing; grass blowing in wind, confetti, water, and fast pans are murder for any encoder. A source set that omits the hard content reports an encoder quality that production will never see.
P.910 turns "varied" into something you can measure. Section 7.8 defines two numbers that place any clip on a map of difficulty. Spatial information (SI) measures how much fine detail a frame holds: each frame's luminance (its brightness channel) is run through a Sobel filter — an edge detector — and the spread of the result is the SI. More edges and texture mean higher SI. Temporal information (TI) measures how much the picture changes frame to frame: it is the spread of the pixel-by-pixel difference between consecutive frames, so fast motion and scene cuts push TI up (ITU-T P.910 §7.8, Annex B).
Plot every candidate source as a point on a graph with TI on one axis and SI on the other, and you can see your coverage at a glance. A good source set spreads across the plane — low and high detail, slow and fast motion. A bad set clumps in one corner, usually low-SI, low-TI (easy content), because that is what was convenient to shoot. The fix is to choose sources that fill the plane, deliberately including the high-motion, high-detail corner where encoders struggle.
Figure 2. Source coverage on the SI/TI plane. A defensible source set (green) spreads across detail and motion; a convenience-sampled set (orange) clusters in the easy corner and never tests the hard content.
Two more source rules round out the section. Clips should be short — P.910's duration guidance centers on roughly ten seconds per stimulus (§7.6), long enough to judge, short enough to keep the test moving. And the source must be genuinely pristine: any compression, scaling, or noise already baked into the "original" becomes an invisible part of every score, because the viewer is comparing against a reference that was already damaged. The number of distinct sources matters too (§7.7); a handful is enough to generalize only if they are well spread, which is exactly what the SI/TI map is for.
Building the impairment matrix: span the range, balance the cells
With sources chosen, you pick the conditions — the HRCs — and how to cross them with the sources. Two decisions decide whether the matrix is sound.
The first is spanning the quality range. A frequent mistake is to cluster every condition near the top, because the team only cares about high quality. But a panel calibrates its use of the rating scale to the range of quality it sees. If every clip is between "good" and "excellent," viewers compress their votes into the top of the scale, and the differences you care about vanish into rounding. Include conditions that span from clearly bad to near-transparent, even if you only care about the top, so the scale stays stretched and the top-end differences stay visible. Anchor conditions at both ends of the range — a deliberately awful encode and a near-pristine one — give the panel something to calibrate against.
The second is balance. The clean design is full-factorial: every source passes through every condition, so each HRC is judged on the same content as every other, and no condition gets an unfair advantage from easy footage. The matrix in Figure 1 is full-factorial. When the full grid is too large to run — a real risk, as the next section shows — you move to a fractional design, where not every source meets every condition, and you must report that choice and randomize it so it does not bias any one HRC. P.910 also describes an unrepeated scene design (§11.2), where each source appears just once across the whole test, trading the within-source comparison for far broader content coverage; it is a legitimate choice, but a different one, and you must declare it.
Worked example: sizing the matrix against the fatigue ceiling
Here is the arithmetic that turns a matrix into a schedule, and it is the single most useful calculation in test design. Suppose you want to compare two encoders at four bitrates each — eight conditions — and you choose six source clips that span the SI/TI plane.
The matrix is six sources times eight conditions:
PVS = SRC × HRC = 6 × 8 = 48 processed clips
You pick Absolute Category Rating (ACR), the single-stimulus method from the methods article: one clip, one vote. Each clip runs about ten seconds, and a viewer needs roughly five seconds to register a vote, so each rating consumes about fifteen seconds:
voting time = 48 clips × 15 s = 720 s = 12 minutes
Twelve minutes of voting fits inside a single session. But the session is not only voting. ITU-R BT.500-15 caps a session at about thirty minutes to control fatigue, and it requires about five "dummy" stabilizing clips at the start of the first session whose votes are discarded (ITU-R BT.500-15). Add a short instruction and training phase, the five stabilizing clips, and a mid-session breather, and the twelve minutes of voting becomes roughly twenty to twenty-two minutes of session — comfortably under the ceiling. One session works.
Now watch the ceiling approach. Suppose the comparison grows: two more codecs and a finer bitrate ladder take you to sixteen conditions on the same six sources.
PVS = 6 × 16 = 96 clips
voting time = 96 × 15 s = 1,440 s = 24 minutes
Twenty-four minutes of voting plus the training and stabilizing overhead lands near twenty-seven minutes — still one session, but with almost no room left for a break. Push one step further and broaden the content to eight sources:
PVS = 8 × 16 = 128 clips
voting time = 128 × 15 s = 1,920 s = 32 minutes
Thirty-two minutes of voting alone clears the thirty-minute ceiling before you add a single training clip. Now you must split the test into two balanced sessions with a break between — and re-anchor at the start of the second, because a fresh session needs fresh calibration. The method matters as much as the matrix: a double-stimulus method shows two clips per judgment and takes about twice as long, so even the original forty-eight-clip matrix would have run to about twenty minutes of voting, and this 128-clip one to nearly an hour. The method, the matrix, and the session limit are one coupled decision, not three separate ones.
This is why sizing the matrix is the design lever. Every source and every condition you add is real viewer-minutes, and the thirty-minute fatigue ceiling is hard. The downloadable design worksheet at the end runs this calculation for your own numbers and tells you how many sessions you need and where you cross the line.
Viewing conditions: control the room or measure the room
If viewers watch under different conditions, you have measured the conditions, not the video. ITU-T P.910 §9 and ITU-R BT.500-15 specify the environment precisely, and the precision is the point: a controlled environment is what lets you attribute a score difference to the codec rather than the lighting.
The first decision is controlled versus uncontrolled (P.910 §9.1). A controlled (lab) environment fixes the display, the lighting, and the viewing distance for every subject. An uncontrolled environment — the viewer's own device, wherever they are — trades that control for realism and scale, and is the domain of crowdsourced testing under ITU-T P.808 (covered in crowdsourced subjective testing). The two are not interchangeable: a lab test answers "which encode is better under ideal viewing," a crowd test answers "which encode is better in the wild," and you must say which question you asked.
For a controlled test, BT.500-15 fixes the room. The display's peak luminance sits between 70 and 500 cd/m². The ratio of the background behind the screen to the screen's peak luminance is about 0.15, and the screen illuminance in a home-like environment is around 200 lux (ITU-R BT.500-15). The display must be calibrated, not factory-default, because an uncalibrated screen bends every color and contrast judgment.
Viewing distance is its own decision (P.910 §9.3). BT.500 defines distance as a multiple of picture height, and lets you pick between the preferred viewing distance (PVD) — where viewers naturally sit — and the design viewing distance (DVD), the geometric distance at which the eye just resolves the pixel grid. The choice depends on the question: PVD for "how will real viewers experience it," DVD for "how good can this look at the limit of acuity." Whichever you pick, fix it for everyone and report it; distance changes which artifacts are visible, so a floating viewing distance silently adds noise to every score.
Figure 3. The controlled viewing environment from BT.500-15: fixed viewing distance in picture heights, calibrated display at 70–500 cd/m² peak, ~200 lux room light, and a background at ~0.15 of peak. Fix these or you measure them.
Randomization and order: defeat the anchoring effect
Show a mediocre clip right after a terrible one and it looks better than it is; show it right after a flawless one and it looks worse. This is the contextual, or anchoring, effect, and it is strong enough to swamp the difference between two encoders if you let presentation order line up with condition. Randomization is the defense.
ITU-T P.910 §12.7.4 requires that the order of presentation be randomized, and that the randomization be undone in analysis so the design stays balanced. The standard practice, used in the VQEG test plans, is a Latin-square or Graeco-Latin-square ordering: a structured randomization that guarantees each condition appears in each serial position roughly equally across subjects, so no condition is systematically advantaged by always following an easy or hard clip. At least two different randomized presentation orders are created, and subjects are split roughly evenly across them, so any residual order effect cancels.
Two practical rules fall out of this. First, never show the same source twice in a row, even under different conditions — the viewer remembers the content and rates the comparison instead of the clip. Second, randomize per subject or per small group, not once for the whole panel, so a single unlucky ordering cannot bias the whole result. The randomization is bookkeeping you must keep, because at analysis time you map each vote back to its true SRC and HRC; the viewer saw chaos, the spreadsheet sees the clean matrix.
Training, anchors, and stabilization: calibrate before you count
A viewer's first few votes are unreliable, because they are still learning what the scale means and what the range of quality is. Three design elements fix this, and skipping them is a quiet way to corrupt the early data.
Training comes first (P.910 §12.5). Before any scored clip, the viewer is shown the instructions, a demonstration of the voting mechanism, and a set of training clips that span the full quality range — from the worst the test will show to the best — so they calibrate their internal scale before it counts. The critical rule: the training clips must use different source content from the test, or the viewer arrives at the real clips already primed on that footage, leaking information into the scores. P.910's appendices even provide sample instructions and a sample consent form to standardize this phase.
Anchors are the explicit best-and-worst examples inside that training set. By showing a deliberately awful encode and a near-pristine one and naming them as the ends of the scale, you give every viewer the same reference points for "1" and "5," which tightens agreement between subjects and makes the resulting MOS comparable across the panel.
Stabilizing presentations handle the residual settling even after training. ITU-R BT.500-15 calls for about five "dummy" clips at the very start of the first session, spanning the quality range, whose votes are collected for realism but discarded from the results (ITU-R BT.500-15). They absorb the last of the scale-settling drift so the first counted vote is already stable. When a test is split across sessions, re-anchor at the start of each one — a viewer returning from a break has lost some of their calibration.
Figure 4. The shape of a session. Instructions and full-range training (on different content) come first, then ~5 discarded stabilizing clips, then the randomized scored block — all inside the ~30-minute fatigue limit, with re-anchoring after any break.
Session length and fatigue: the thirty-minute ceiling
Tired viewers give worse data — more variable, and biased downward as attention fades. Both standards treat fatigue as a hard design constraint, not a comfort issue. ITU-R BT.500-15 limits a single session to about thirty minutes, and ITU-T P.910 §11.1 frames the size of the experiment explicitly in terms of subject fatigue, with §12.6 covering session and break structure.
The practical consequences are three. A matrix whose voting time pushes past the session limit must be split into multiple balanced sessions, with each condition spread evenly across sessions so no HRC is concentrated in the fatigued tail of a session. Each session beyond the first needs its own re-anchoring, because calibration decays across a break. And the per-clip timing assumptions — ten seconds of content, five seconds to vote — should be confirmed with a pilot study (P.910 §11.8) before committing the panel, because real per-clip time varies with content, instructions, and platform, and a wrong estimate is how a test that looked like one session becomes two halfway through.
Subjects: how many, and how to screen them
The panel size and screening are the last design decisions, and the temptation to fudge them after the fact is strong — resist it.
ITU-T P.910 §10.1 sets the floor: at least fifteen valid subjects per condition in a controlled test. More subjects narrow the confidence interval; the methods article noted that ACR's smallest reliably resolvable MOS difference tightens from about 1.5 points at six subjects to about 0.5 at twenty-four (P.910 §8.1.1), which is why twenty-four is a common target when the differences are small. In an uncontrolled or public environment, P.910 notes you need roughly thirty-five subjects to reach the same statistical sensitivity as the controlled fifteen, because the extra environmental noise has to be averaged out.
Screening comes in two stages. Pre-screening checks each subject's vision before they start — visual acuity (a Snellen-style chart) and color vision (an Ishihara-style test) — so a viewer who cannot resolve the detail under test is not silently adding noise (P.910 §12.2–12.3). Post-screening runs after the data is in, rejecting subjects whose ratings are too inconsistent with the panel to be trusted; P.910 Annex A gives a method based on the Pearson correlation between each subject and the panel mean, and §13.6 describes a bias-subtracted, consistency-weighted refinement. The discipline that matters: decide the screening rules and the target subject count before you see the scores. Choosing the panel size or the rejection threshold after looking at the data, to push a result over the significance line, is the difference between a measurement and a rationalization. The statistics of all this — confidence intervals, outlier rejection, significance testing — are the subject of the statistics of subjective data.
The design checklist: P.910 §14 in reverse
Pull it together and the "survives scrutiny" test is mechanical: can you write the report ITU-T P.910 §14 requires? Each line is a design decision made on purpose.
Figure 5. The design checklist. Each row is something P.910 §14 requires you to report — which means each is a design decision to make deliberately, not a box to fill in afterward.
The seven decisions, with the question each answers and the clause behind it:
| Design decision | The scrutiny question it answers | Where it lives |
|---|---|---|
| Source set (SRC) | "Is this content representative, or just what was on hand?" | P.910 §7; SI/TI §7.8 |
| Impairment matrix (HRC) | "Do the conditions span the range, and is the matrix balanced?" | P.910 §11.2 |
| Viewing conditions | "Did everyone watch under the same controlled conditions?" | P.910 §9; BT.500-15 |
| Randomization | "Could presentation order have biased a condition?" | P.910 §12.7.4 |
| Training & anchors | "Were viewers calibrated before their votes counted?" | P.910 §12.5; BT.500-15 |
| Subjects & screening | "Enough valid viewers, screened by a pre-set rule?" | P.910 §10.1, §12.2–12.4 |
| Session & fatigue | "Were sessions short enough to keep attention?" | P.910 §11.1; BT.500-15 |
If every row has a deliberate answer you can defend, the test survives. If any row was left to chance or convenience, that is where it will be taken apart.
The common mistakes
These six recur often enough to name, and each maps to a row above.
Mistake 1 — a convenience source set. Testing only the footage that was easy to get (usually low-motion, low-detail) reports a quality that production never sees. Spread the sources across the SI/TI plane (P.910 §7.8), and include the hard content on purpose.
Mistake 2 — clustering all conditions at the top. When every clip is "good to excellent," viewers compress their votes and the top-end differences disappear. Span the full range and anchor both ends, even if you only care about the top.
Mistake 3 — uncontrolled or mixed viewing conditions. Half the panel on a calibrated monitor and half on a laptop in sunlight measures the room, not the codec. Fix the display, lighting, and distance (BT.500-15), or run an explicit P.808 crowdsourced test and call it one.
Mistake 4 — a fixed presentation order. If condition and order line up, the anchoring effect masquerades as a quality difference. Randomize per subject with a Latin square and balance across at least two orders (P.910 §12.7.4).
Mistake 5 — choosing the subject count after seeing the data. Picking N or the rejection threshold to push a result over the significance line is rationalization, not measurement. Fix the count (≥15 valid, P.910 §10.1) and the screening rule before you look.
Mistake 6 — training on the test content. Reusing source clips in the training phase leaks information into the scores. Train on different content that spans the same quality range (P.910 §12.5).
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the discipline in this article is what we run before we let a subjective number into a client decision. When we evaluate an encoder for a conferencing product, we build the source set deliberately across the SI/TI plane, because a talking-head-only panel would hide exactly the screen-share and motion cases that break conferencing video. When we test a surveillance pipeline, we fix the viewing distance and lighting to the control-room reality rather than an idealized lab, because that is where the footage is actually judged. And when we publish a quality result, we report the full §14 design — sources, conditions, environment, panel, screening, and confidence intervals — the same way we document our benchmark methodology, so the number holds up in the meeting where it matters.
What to read next
- Test methodologies: ACR, DCR, PC, and the ITU recommendations
- Running the test: participants, screening, and environment
- The statistics of subjective data
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your subjective video quality test plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, approved 29 October 2023. Tier 1 (official standard). The controlling recommendation: source stimuli and content selection (§7), coding complexity (§7.2.2), subject matter (§7.2.3), convenience-sampling caution (§7.2.7), stimulus duration (§7.6), number of sources (§7.7), SI/TI scene-selection metrics and the Sobel filter (§7.8, Annex B); environment, viewing distance, and viewing conditions (§9); number of subjects (§10.1); experiment size and fatigue (§11.1), unrepeated-scene designs (§11.2), pilot study (§11.8); instructions and training (§12.5), sessions and breaks (§12.6), stimuli randomization (§12.7.4), screening (§12.2–12.4, Annex A); mandatory reporting (§14). https://www.itu.int/rec/T-REC-P.910-202310-I/en
- Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector, approved 28 May 2023. Tier 1 (official standard). Source of the controlled-environment specification (peak luminance 70–500 cd/m², background-to-peak ratio ~0.15, ~200 lux home illuminance), the preferred-vs-design viewing-distance choice in picture heights, the ~30-minute session limit, and the ~5 discarded stabilizing presentations at the start of a session. https://www.itu.int/rec/R-REC-BT.500
- Recommendation ITU-T P.808 (06/2021), "Subjective evaluation of speech quality with a crowdsourcing approach," and its video extension work, International Telecommunication Union. Tier 1 (official standard). The framework for uncontrolled, crowdsourced subjective testing referenced as the alternative to a controlled-environment design; the boundary between this article's lab design and a crowd design. https://www.itu.int/rec/T-REC-P.808
- Recommendation ITU-T P.913 (deletion notice), International Telecommunication Union; recommendation deleted 2 February 2024, content incorporated into ITU-T P.910 (10/2023). Tier 1 (official standard status). Recency anchor: the "any environment" subjective methods and the former P.911 are now consolidated into P.910 (10/2023) — cite the current edition for all design guidance. https://www.itu.int/rec/T-REC-P.913
- M. H. Pinson, S. Wolf, "Comparing subjective video quality testing methodologies," Visual Communications and Image Processing (VCIP) 2003, SPIE vol. 5150, pp. 573–582. Tier 5 (peer-reviewed). Foundational comparison of test methodologies and the effect of design choices (including source and order) on the reliability of the resulting scores. https://its.ntia.gov/publications/download/spie03obj.pdf
- M. H. Pinson et al., "ITS4S: A Video Quality Dataset with Four-Second Unrepeated Scenes," NTIA Technical Memo TM-18-532, 2018. Tier 5 (institutional). The reference for unrepeated-scene experiment design (P.910 §11.2) and the trade between within-source comparison and broad content coverage. https://its.ntia.gov/publications/download/TM-18-532.pdf
- T. Hoßfeld, R. Schatz, S. Egger, "SOS: The MOS is not enough!" (the Standard deviation of Opinion Scores hypothesis), 3rd International Workshop on Quality of Multimedia Experience (QoMEX), 2011. Tier 5 (peer-reviewed). The relationship between MOS and rating spread that motivates spanning the full quality range and anchoring the scale rather than clustering conditions at the top. https://ieeexplore.ieee.org/document/6065690
- Video Quality Experts Group (VQEG), "HDTV Test Plan" (v3.1) and the FR-TV subjective test plans, VQEG. Tier 5 (institutional). The practical source of the Latin-square / Graeco-Latin-square randomization, the multiple-presentation-order balancing, and the source-selection and screening procedures used across the field. https://vqeg.org/media/5871/vqeg_hdtv_testplan_v3_1.doc
- S. Winkler, "Analysis of public image and video databases for quality assessment," IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, 2012, pp. 616–625. Tier 5 (peer-reviewed). Survey of how source content and SI/TI coverage vary across the major public quality databases — evidence for designing source coverage deliberately rather than by convenience. https://ieeexplore.ieee.org/document/6285949


