Why this matters

Designing a subjective test and running one are two different skills, and a beautiful design dies fast in a badly run room. This article is for the engineer, QA lead, or researcher who has the design from the previous article and now has to put real people in front of real screens and come away with numbers that survive review. It is the execution half of the subjective-testing block: design decisions live in designing a subjective test that survives scrutiny, the inferential statistics that follow — confidence intervals, significance, outlier rejection in depth — live in the statistics of subjective data, and the encoder-operator's one-screen version sits in the Video Encoding section's subjective-testing overview. Here we cover the run itself.

The four jobs of test day

Once the design is fixed — the source set, the impairment matrix, the method, the session plan — running it well reduces to four jobs done in order. You recruit a panel large enough to stay above the floor after some viewers are dropped. You screen each viewer's eyesight before they cast a single vote. You set the display and the room to a measured specification, not a vibe. And after the session you run a defined procedure to reject the observers whose votes were too erratic to trust. Skip any one and the Mean Opinion Score — the average rating the panel gave each clip, defined in full in MOS, DMOS, and the rating scales — is contaminated in a way no later arithmetic can clean.

Run-day flow: recruit a buffered panel, pre-screen eyesight, calibrate the room, run the session, drop unreliable observers. Figure 1. The run-day pipeline. Each stage protects the final number; the panel that produces the MOS is the valid panel left after pre-screening and post-screening, not everyone who showed up.

Recruiting the panel: how many people, and who

The headline number is small and often misread: at least fifteen valid observers. ITU-R BT.500-15 §2.5.1 states that, unless the method says otherwise, at least fifteen observers should be used, and a test run with fewer must be labelled informal (ITU-R BT.500-15 §2.5.1). ITU-T P.910 §10.1 sets the same floor of fifteen valid subjects for a controlled test.

The word that does the work is valid. Fifteen is the floor for viewers who pass both the eyesight screen at the start and the reliability screen at the end. Because post-screening can drop a viewer or two, you recruit a buffer above fifteen — a panel of eighteen to twenty-four is common — so a couple of rejections do not push you under the floor. The buffer is not padding; it is the whole reason the floor is stated in terms of valid observers rather than warm bodies.

More observers buy tighter results. The methods article showed that the smallest reliably resolvable difference in MOS shrinks as the panel grows, which is why twenty-four is a frequent target when the quality differences are small. And when the room is not controlled — a public or crowdsourced setting — P.910 notes you need roughly thirty-five subjects to reach the statistical sensitivity that fifteen give you in a lab, because the extra environmental noise has to be averaged out.

Who the viewers are matters as much as how many. BT.500-15 distinguishes expert observers, who know the artefacts the system under test can produce, from non-expert ("naive") observers, who do not; most quality tests want non-experts, because they stand in for real viewers (ITU-R BT.500-15 §2.5). Critically, observers must not have been involved in developing the system under test, or they bring knowledge that biases their scores. BT.500 also asks you to report the panel's makeup — an occupation category, gender, and age range — because a panel of twenty broadcast engineers and a panel of twenty office workers can rate the same clips differently, and the reader of your result needs to know which one you used.

Pre-screening: check the eyes before the test

A viewer who cannot resolve the detail under test, or who cannot see the colour error you are measuring, does not register as a bad data point — they register as noise, scattered through every score they cast. Screening removes that noise before it enters the data, and it takes five minutes per person.

Two checks, both standard. Visual acuity — how sharply a viewer sees fine detail — is measured on a Snellen or Landolt eye chart, the rows of shrinking letters or gapped rings you read at a fixed distance (ITU-R BT.500-15 §2.5.2). P.910's threshold is concrete: the viewer must make no error on the 20/30 line of a standard chart, with corrective glasses or contacts allowed (ITU-T P.910). Twenty-thirty means they can read at twenty feet what a normal eye reads at thirty — a mild bar that screens out viewers who would miss the fine artefacts the test is built to expose.

Colour vision is checked with an Ishihara test — the plates of coloured dots with a number hidden inside that a colour-blind viewer cannot read (ITU-R BT.500-15 §2.5.2). Roughly one man in twelve has some colour-vision deficiency, so on an unscreened panel of twenty you can expect one or two viewers who will rate a colour-bleeding artefact as fine because they cannot see it. They are not bad viewers; they are simply measuring something different from the rest of the panel, and that difference becomes noise in the average.

Pre-screening: a Snellen acuity chart with the 20/30 line and an Ishihara colour-vision plate; pass enters, fail is excluded. Figure 2. Pre-screening. Acuity on a Snellen/Landolt chart (no error on the 20/30 line, glasses allowed) and colour vision on an Ishihara plate. A viewer who fails is excluded before voting, not dropped afterward.

Record the result for every viewer, pass or fail, and keep the record. The screening method and its criteria are part of what BT.500 expects you to publish with the results, because a reader cannot judge a panel they cannot see. "We screened for normal vision" is not a method; "no error on the 20/30 Snellen line and a 24-plate Ishihara test, corrective lenses permitted" is.

Setting up the environment: calibrate the room, not just the screen

If two viewers watch under different conditions, the difference in their scores partly measures the conditions, not the codec. The fix is to set the display and the room to a written specification and measure that you hit it. BT.500-15 gives two specifications, and choosing the right one is the first decision.

A laboratory environment is the critical-viewing setup: low room illumination, a background of D65 (the standard daylight white point) behind the screen, a display peak luminance between 70 and 250 cd/m², and the luminance of that background held at about 0.15 of the screen's peak (ITU-R BT.500-15 §2.1.1). A home environment trades some control for realism: about 200 lux of room light falling on the screen, measured perpendicular to it, and a display peak between 70 and 500 cd/m² (ITU-R BT.500-15 §2.1.2). The lab answers "how good can this look under ideal viewing"; the home setup answers "how good is it where people actually watch". Pick one deliberately and report which.

Either way, the display is calibrated, not factory-default. Brightness and contrast are set with a PLUGE signal — the Picture Line-Up Generation Equipment pattern, a set of reference black and near-black bars defined in ITU-R BT.814/BT.815 — adjusted under the room's actual illumination (ITU-R BT.500-15 §2.1.6). An uncalibrated screen bends every contrast and colour judgement the panel makes, so two labs running the same clips on the same model of monitor, one calibrated and one not, will disagree — and BT.500 explicitly warns that systematic differences between testing sites are real.

Controlled setup: lux meter on screen, calibrated display at peak luminance, D65 background, distance in picture heights. Figure 3. The measured room. Lab: low light, peak 70–250 cd/m², background ≈ 0.15 × peak, D65. Home: ≈ 200 lux on the screen, peak 70–500 cd/m². Display set with a PLUGE pattern; distance fixed in picture heights.

The last setting is viewing distance, fixed for everyone and expressed as a multiple of picture height, because that is what determines which artefacts are visible. BT.500 offers two choices: the preferred viewing distance (PVD), where viewers naturally choose to sit, and the design viewing distance (DVD), the geometric distance at which two adjacent pixels just merge — where the eye is at the limit of the display's resolution (ITU-R BT.500-15 §2.1.3). The choice follows the question: PVD for realism, DVD for the hardest test of resolution. For a 4K (3840 × 2160) display, BT.500 places the resolution-critical distance in the range of about 1.6 to 3.2 picture heights, with the lower end used when the test is specifically about resolution. Mark the distance on the floor and seat every viewer there; a floating distance silently adds noise to every score.

Running the session: instructions, dummies, and order

With viewers screened and the room set, the session itself follows a fixed shape, and the shape is not just the scored clips.

It opens with instructions and training. The viewer is told the method, shown the rating scale and the timing, and walked through training clips that span the full quality range — on different content from the test, so they do not arrive at the real clips pre-exposed (ITU-R BT.500-15 §2.5.3). Then comes a quiet trick that catches many first-time testers off guard: about five "dummy" presentations at the very start of the first session, spanning the range, whose votes are collected for realism and then thrown away. They absorb the last of the viewer's scale-settling so the first counted vote is already stable. If the test runs across several sessions, about three dummy presentations are enough at the start of each later session (ITU-R BT.500-15 §2.6).

The scored block runs in randomized order. The standard practice is a Graeco-Latin square — a structured randomization that spreads each condition evenly across serial positions, so no condition is systematically helped by always following an easy or hard clip — with the condition order balanced so that fatigue and adaptation effects cancel from session to session (ITU-R BT.500-15 §2.6). A useful extra: repeat a few presentations across sessions to check that a viewer rates the same clip consistently, a cheap internal coherence test.

And the whole thing stays under about thirty minutes of session time. BT.500-15 §2.6 caps a single session at roughly half an hour to control fatigue, and a tired viewer gives more variable, downward-biased votes. When the voting time pushes past the limit — the design article works the arithmetic — you split into balanced sessions with breaks, and re-anchor at the start of each, because calibration decays across a break.

Post-screening: rejecting the unreliable observers

Even with good pre-screening, some viewers vote erratically — they misunderstand the scale, lose attention, or simply rate inconsistently. BT.500-15 gives a defined procedure to find and remove them after the session, and running it is the last execution job. This is screening for reliability, distinct from the eyesight screen at the start, and it is why you recruited a buffer.

The procedure (BT.500-15 Annex 1, §A1-2.3.1) works one presentation at a time. First, decide whether the spread of scores on that presentation is roughly bell-shaped by computing its kurtosis, β2 — a single number, the fourth statistical moment of the scores divided by the square of the second, that measures how heavy the distribution's tails are. If β2 lands between 2 and 4, the scores are treated as normal, and the "expected" band is the mean plus or minus two standard deviations. If not, the band widens to the mean plus or minus the square root of twenty (≈ 4.47) standard deviations. Every time a viewer's score falls above the band you tick a counter Pᵢ for that viewer; every time it falls below, you tick a counter Qᵢ.

After every presentation has been processed, two ratios decide each viewer's fate. The first is how often they landed outside the band at all: (Pᵢ + Qᵢ) divided by their total number of scores. The second is how one-sided those misses were: the absolute value of (Pᵢ − Qᵢ) divided by (Pᵢ + Qᵢ). A viewer is rejected when the first ratio exceeds 0.05 and the second is below 0.30 (ITU-R BT.500-15 §A1-2.3.1). In plain terms: reject a viewer who is often outside the band and whose misses are balanced high and low — that pattern is random noise, not a consistent opinion.

Worked example: who gets dropped

Take a session of sixty scored presentations and look at two viewers.

Viewer A:  Pᵢ = 4 (scores above band), Qᵢ = 3 (below), total = 60
  ratio 1 = (Pᵢ + Qᵢ) / total      = (4 + 3) / 60 = 7/60  = 0.117   → > 0.05 ✓
  ratio 2 = |Pᵢ − Qᵢ| / (Pᵢ + Qᵢ)  = |4 − 3| / 7  = 1/7   = 0.143   → < 0.30 ✓
  Both conditions met → REJECT viewer A

Viewer A is outside the band on 12% of clips, and those misses are split almost evenly high and low — the signature of someone voting randomly. They are dropped. Now a viewer who is just as often out of band, but differently:

Viewer B:  Pᵢ = 6 (scores above band), Qᵢ = 0 (below), total = 60
  ratio 1 = (6 + 0) / 60 = 6/60 = 0.100   → > 0.05 ✓
  ratio 2 = |6 − 0| / 6   = 6/6  = 1.000   → NOT < 0.30 ✗
  Second condition fails → KEEP viewer B

Viewer B is out of band just as often, but always on the high side. That is a consistent scale offset — a viewer who simply rates generously, not randomly — and a consistent offset is real signal the analysis can handle, so BT.500 keeps them. The test is built to catch the random voter, not the systematically lenient one. That distinction is the whole point of the second ratio.

Observer-rejection flow: compute kurtosis, set the band, count out-of-band scores, then apply the two-ratio reject test. Figure 4. The BT.500-15 observer-rejection test. Kurtosis sets the band width (2σ if normal, √20·σ if not); a viewer is rejected only when they are frequently out of band AND their misses are balanced high and low — the signature of random voting.

Two guardrails come with the procedure. BT.500 says apply it only once to a given experiment — re-running it until the result looks clean is exactly the rationalization it is meant to prevent — and restrict it to tests with relatively few observers (fewer than twenty), all non-experts (ITU-R BT.500-15 §A1-2.3.1, Note). The deeper questions — confidence intervals on the surviving MOS, significance between conditions, how many subjects you truly need — are the subject of the statistics of subjective data; here the job is just to remove the viewers a defensible test must remove, by a rule fixed before you saw the data.

Lab versus crowd: the same test, run two ways

The execution above describes a controlled lab. The same design can be run on a crowdsourcing platform, where viewers rate clips on their own devices at home, under the framework of ITU-T P.808 (the speech-quality crowdsourcing recommendation whose approach the video field has adopted, covered in crowdsourced subjective testing). The trade is control for scale and speed, and it changes every execution job.

Execution job Controlled lab Crowdsourced (P.808-style)
Environment You set and measure light, display, distance You control none of it — you can only ask and infer
Pre-screening Snellen + Ishihara in person Remote proxy tests; weaker, easier to game
Attention An operator is in the room Gold-standard clips, hidden traps, time-on-task gates
Reliability BT.500 kurtosis rejection on a small panel Heavy ex-post filtering; rejection rates can be huge
Where it leaks Small panel, one room, costs more Uncontrolled rooms; you measure the wild, not the ideal

The numbers make the trade concrete. Because the crowd environment is unmanaged, crowdsourced studies lean on layered defences — qualification screens, hidden "gold" clips with a known answer, attention traps, and behavioural time gates — and then reject hard afterward; one 2025 study rejected about 92% of crowdsourced participants after rigorous consistency screening. Yet with that strict screening and cleansing in place, crowdsourced MOS can correlate with lab MOS at around 0.95, which is why the method is trusted for the right questions. The rule of thumb: a lab answers "which encode is better under ideal viewing," a crowd answers "which is better in the wild," and you must say which question you ran.

Common mistake — recruiting exactly fifteen. The floor of fifteen is fifteen valid observers, counted after the post-session rejection. Recruit exactly fifteen and a single rejected viewer drops you to fourteen and an informal result (BT.500-15 §2.5.1). Recruit a buffer — eighteen to twenty-four — so the kurtosis screen can do its job without sinking the test.

Common mistake — calibrating the screen but not the room. A PLUGE-calibrated display in a sunlit room is still measuring the room. Set and measure the light too — low for a lab, ≈ 200 lux on the screen for a home setup (BT.500-15 §2.1.1–2.1.2) — and fix the viewing distance for everyone.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the execution discipline in this article is what turns a quality opinion into a number we will put in front of a client. When we screen a panel for a conferencing or OTT evaluation, we run the eyesight checks and the BT.500 reliability rejection before any score reaches a decision, and we recruit a buffer so the floor of fifteen valid observers holds. When the question is "how does this look in a real control room or a real living room," we run the home-environment setup at the measured 200 lux rather than an idealized lab, because that is where the footage is actually judged. And when we publish a result, we report the panel, the screening, and the environment alongside it — the same provenance we attach to our benchmark methodology — so the number holds up after the panel goes home.

What to read next

Call to action

References

  1. Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector, approved 28 May 2023. Tier 1 (official standard). The controlling source for execution: number of observers and the "informal" threshold (§2.5.1); pre-session screening for visual acuity on the Snellen/Landolt chart and colour vision on Ishihara plates, and panel reporting (§2.5.2); instructions and training on different content (§2.5.3); the ≈ 30-minute session limit, the ≈ 5 (then ≈ 3) discarded dummy presentations, and Graeco-Latin-square ordering (§2.6); the laboratory environment — low light, D65, peak 70–250 cd/m², background ≈ 0.15 × peak (§2.1.1); the home environment — ≈ 200 lux, peak 70–500 cd/m² (§2.1.2); PVD vs DVD viewing distance in picture heights and the 4K 1.6–3.2 H range (§2.1.3); display set-up with a PLUGE pattern (§2.1.6); and the kurtosis-based observer-rejection procedure with the β2 test and the two-ratio (5% / 30%) reject rule (Annex 1, §A1-2.3.1). https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.500-15-202305-I!!PDF-E.pdf
  2. Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union (ITU-T Study Group 12), approved 29 October 2023. Tier 1 (official standard). The companion multimedia recommendation: at least 15 valid subjects in a controlled test and ≈ 35 for an uncontrolled/public environment (§10.1); the visual-acuity threshold of no error on the 20/30 line of a standard chart with correction allowed, and colour-vision screening; the controlled-vs-uncontrolled environment decision (§9); training, screening, and session structure (§12). https://www.itu.int/rec/T-REC-P.910-202310-I/en
  3. Recommendation ITU-R BT.814-4 (07/2018), "Specifications of PLUGE test signals and alignment of the parameters of displays for analogue and digital television systems," International Telecommunication Union, Radiocommunication Sector. Tier 1 (official standard). The PLUGE (Picture Line-Up Generation Equipment) signal used to set display brightness and contrast under the room illumination, referenced by BT.500-15 §2.1.6 for display calibration. https://www.itu.int/rec/R-REC-BT.814
  4. Recommendation ITU-T P.808 (06/2021), "Subjective evaluation of speech quality with a crowdsourcing approach," International Telecommunication Union (ITU-T Study Group 12). Tier 1 (official standard). The crowdsourcing framework whose layered screening and gold-standard/attention-check approach the video field has adopted; the boundary between this article's controlled-lab execution and a crowd run. https://www.itu.int/rec/T-REC-P.808
  5. Recommendation ITU-R BT.2022 (08/2012), "General viewing conditions for subjective assessment of quality of SDTV and HDTV television pictures on flat panel displays," International Telecommunication Union, Radiocommunication Sector. Tier 1 (official standard). Companion guidance on viewing conditions and viewing distance on flat-panel displays, consistent with the BT.500-15 environment specification. https://www.itu.int/rec/R-REC-BT.2022
  6. M. H. Pinson, S. Wolf, "Comparing subjective video quality testing methodologies," Visual Communications and Image Processing (VCIP) 2003, SPIE vol. 5150, pp. 573–582. Tier 5 (peer-reviewed). Evidence on how execution choices — panel, environment, and order — drive the reliability of the resulting scores, and on systematic differences between testing laboratories. https://its.ntia.gov/publications/download/spie03obj.pdf
  7. B. Series / Video Quality Experts Group (VQEG), "HDTV Test Plan" (v3.1) and the FR-TV subjective test plans, VQEG. Tier 5 (institutional). The practical source of the Graeco-Latin-square presentation ordering, the buffered-panel and observer-screening procedures, and the multi-lab balancing used across the field. https://vqeg.org/media/5871/vqeg_hdtv_testplan_v3_1.doc
  8. "Ensuring Reliable Participation in Subjective Video Quality Tests Across Platforms," arXiv:2509.20001 (2025). Tier 5 (peer-reviewed preprint). Recency anchor on crowdsourced execution: the layered-defence screening stack (qualification, gold clips, attention traps, time gates), the large ex-post rejection rates observed (≈ 92% in one consistency screen), and the ≈ 0.95 correlation of cleansed crowdsourced MOS with laboratory MOS. https://arxiv.org/abs/2509.20001
  9. T. Hoßfeld, M. Hirth, et al., "Best Practices for QoE Crowdtesting: QoE Assessment with Crowdsourcing," IEEE Transactions on Multimedia, 2014. Tier 5 (peer-reviewed). Foundational best-practice guidance on screening, reliability checks, and rejection in crowdsourced subjective testing, supporting the lab-vs-crowd execution comparison. https://ieeexplore.ieee.org/document/6705599