Why this matters

The point of a subjective test is to produce a number an engineer can act on and a skeptic cannot dismiss, and that only happens if the statistics are right. This article is for the engineer, QA lead, or researcher who has already run the test and now has a spreadsheet of votes to turn into a decision: is encoder B actually better than encoder A, or is the gap just noise? It is the analysis half of the subjective-testing block — the design and the run come first, the inference comes here. Get it wrong and you will ship a codec change that no viewer can see, or kill one that every viewer could. The encoder-operator's one-screen version of subjective testing lives in the Video Encoding section's subjective-testing overview; this is the deep treatment of the numbers.

From a column of votes to a number you can defend

A subjective test ends with a matrix: one row per subject, one column per clip, each cell a rating. The instinct is to average each column and call those averages the result. The average is the right starting point — the Mean Opinion Score, written MOS, is exactly that per-clip average of the subjects' ratings, defined in full in MOS, DMOS, and the rating scales. But an average of 24 opinions is a sample estimate of what the whole population would have said, and a sample estimate is only as trustworthy as its spread and its size. Treating the bare average as the truth is the single most common statistical error in video quality work.

Four jobs stand between the raw matrix and a defensible result, and the rest of this article is one per section. First, put a confidence interval on every MOS, so a reader sees the uncertainty, not just the point. Second, find and handle the subjects whose votes are unreliable, by a rule fixed before you looked at the data. Third, test whether the difference between two conditions is real or within the noise. Fourth — logically first, but easiest to understand last — work out how many subjects the test needed to resolve the difference you cared about. Skip any one and the number looks finished but is not.

A column of per-subject votes becomes a Mean Opinion Score with a confidence interval, via the spread and the panel size. Figure 1. From votes to a defensible number. The MOS is the column average; the spread of the column (SOS) and the panel size (N) set the standard error, which sets the confidence interval. The point without the interval is half a result.

MOS is an estimate, not a measurement: the confidence interval

Start with the spread. The standard deviation of the opinion scores for one clip — how far the individual votes sit from their own average, written SOS in the standards — is the raw measure of how much the panel disagreed about that clip. A clip everyone rated 4 has a tiny SOS; a clip split between 2s and 5s has a large one. The SOS is not a nuisance to be averaged away; it is the input that tells you how much to trust the MOS.

From the spread and the count you get the standard error of the mean — the typical distance between your sample's MOS and the true population MOS. The standard error is the standard deviation divided by the square root of the number of subjects:

standard error = SOS / sqrt(N)

That square root is the whole story of sample size, and we return to it. The 95% confidence interval — the band that would contain the true MOS in 95 of 100 repeats of the test — is the MOS plus or minus a multiple of that standard error. ITU-R BT.500-15 (05/2023), Annex 1, uses the normal-distribution multiplier 1.96:

95% CI  =  MOS  ±  1.96 × (SOS / sqrt(N))

ITU-T P.910 (10/2023) is more careful for the small panels subjective tests actually use. With few subjects you have estimated the spread from the same small sample, so you use the Student's t-distribution — a slightly fatter-tailed cousin of the normal curve — and the multiplier is no longer a fixed 1.96 but a t-value that depends on the panel size. For 24 subjects the multiplier is about 2.07; for 15 subjects about 2.14; for 6 subjects about 2.57. The t-version is always a little wider than the 1.96 version, and that extra width is honesty about having measured the spread from a handful of people.

Work a real number. Suppose a clip was rated by 24 subjects, the MOS came out at 3.80 on the 5-point scale, and the SOS (the spread of those 24 votes) was 0.90.

standard error = SOS / sqrt(N)   = 0.90 / sqrt(24)  = 0.90 / 4.899 = 0.184
BT.500 (1.96):  CI half-width    = 1.96 × 0.184      = 0.360
  → MOS = 3.80 ± 0.36  →  [3.44, 4.16]
P.910 (t = 2.069, 23 df): half-width = 2.069 × 0.184 = 0.380
  → MOS = 3.80 ± 0.38  →  [3.42, 4.18]

The clip's "score" is not 3.80. It is "somewhere between about 3.4 and 4.2, most likely near 3.8." Report the interval, not just the point — a MOS without a confidence interval is an opinion about an opinion. And notice the practical message hiding in the standard error: because it divides by the square root of N, halving the interval takes four times the subjects, not twice. That single fact governs every sample-size decision later in this article.

How wide should the spread be? The SOS hypothesis

Before trusting a panel's spread, it helps to know what a normal spread even looks like. Here the most useful result in modern subjective-quality statistics is the SOS hypothesis, introduced by Hoßfeld, Schatz, and Egger in 2011. It says that across a well-run test, the disagreement between subjects is not random clip to clip — it follows a predictable square law tied to the MOS. Disagreement is smallest at the ends of the scale, where almost everyone agrees a clip is excellent or unwatchable, and largest in the middle, where "fair versus good" is a genuine judgement call.

For the 5-point scale (lowest rating 1, highest 5), the hypothesis writes the variance (the SOS squared) as a single parabola:

SOS²(x) = a × ( -x² + 6x - 5 )          x = MOS, scale 1..5

The one free number, a, is the SOS parameter. It runs from 0 (every subject agreed perfectly on every clip — impossibly clean) to 1 (subjects split to the extremes — maximum disorder). It summarizes the inter-subject disagreement of the whole test in one value, which makes it a fast sanity check and a fair way to compare the noisiness of two experiments.

Use it on the worked clip. At MOS = 3.80 the parabola's bracket is -(3.80²) + 6(3.80) - 5 = -14.44 + 22.80 - 5 = 3.36. The observed SOS was 0.90, so the observed variance is 0.90² = 0.81, and the implied parameter is a = 0.81 / 3.36 = 0.24. An a near 0.2–0.25 is unremarkable for a careful lab ACR test, so this clip's spread is normal, not alarming. Had the same MOS come with an SOS of 1.6, the implied a would be 2.56 / 3.36 = 0.76 — a red flag that the scale was confusing, the content was inconsistent, or the panel was contaminated, and a cue to look at the votes before trusting the MOS.

The SOS hypothesis: subject disagreement is a parabola in MOS, near zero at the scale ends and maximal at mid-scale, scaled by the parameter a. Figure 2. The SOS hypothesis. Plotting subject disagreement (SOS) against the MOS gives an inverted parabola: agreement is high at the scale extremes and lowest at the middle. The single parameter a sets the height of the whole family of curves and summarizes a test's noisiness.

Rejecting unreliable subjects: three families of method

Some subjects vote in a way that does not reflect quality — they misread the scale, lose attention, or rate at random. Their votes widen every interval and can shift a MOS. Running the test introduced the one rejection rule every lab should run; the statistics block is where the alternatives belong, because the choice of method changes the result. There are three families, and they trade simplicity for accuracy.

The first family is the kurtosis rule of ITU-R BT.500-15, Annex 1. It is a hard outlier test: for each clip it asks whether the votes are bell-shaped (via the kurtosis, the fourth statistical moment), sets an expected band of two standard deviations (or √20 ≈ 4.47 if the votes are not bell-shaped), counts how often each subject lands above and below the band, and rejects a subject who is frequently outside it and whose misses are balanced high and low — the signature of random voting. The full procedure and a worked example are in the previous article; BT.500 says apply it only once, only to small panels of non-experts. Its virtue is that it is fixed and auditable; its limit is that it is a blunt yes/no with hand-set thresholds.

The second family is correlation-based screening, the method in ITU-T P.910 Annex A. It computes, for each subject, how well their ratings track the panel's MOS — the linear Pearson correlation coefficient (and often the rank-based Spearman correlation too) between that subject's column and the per-clip means. A subject who truly perceives quality correlates strongly with the group; a random voter correlates near zero. Subjects below a correlation threshold are removed. This catches a subject who is consistently wrong (negatively or weakly correlated) even when their individual votes never look like extreme outliers — a failure mode the kurtosis test can miss.

The third family is the subject-behaviour model, the most recent and now the standard-bearer. Zhi Li and Christos Bampis of Netflix proposed in 2020 a model that treats each vote as the true quality plus two per-subject effects: a bias (a subject who rates everything half a point high) and an inconsistency (a subject whose votes scatter widely regardless of quality). Maximum-likelihood estimation recovers the true quality, each subject's bias, and each subject's inconsistency at once, then forms a bias-subtracted, consistency-weighted MOS — a weighted average that quietly trusts steady subjects more and erratic ones less, instead of bluntly keeping or dropping them. It was standardized into ITU-T P.913 (06/2021), clause 12.6, and the companion procedure in ITU-T P.910 (10/2023), and Netflix's open sureal library implements it. Its reported advantages over the hard tests are tighter confidence intervals, better resistance to outliers, and no hand-set thresholds. This is the difference between hard outlier handling (keep or reject) and soft handling (weight by reliability); a 2025 study comparing them found soft, model-based handling generally recovers cleaner scores.

Screening method What it does What it catches Where it lies / its limit
BT.500-15 kurtosis (hard) Reject subjects often outside a 2σ/√20·σ band with balanced misses The random voter Blunt yes/no; hand-set thresholds; small non-expert panels only; applied once
P.910 Annex A correlation (hard) Drop subjects whose ratings correlate weakly with the MOS The consistently-wrong voter the kurtosis test misses Still a threshold cut; a real minority opinion can look like low correlation
Subject-behaviour model (soft) MLE recovers true quality + per-subject bias + inconsistency; weight by consistency Bias and inconsistency separately, without discarding data Heavier to compute; needs a library (e.g. Netflix sureal); newer, less familiar to reviewers

Whichever you choose, fix it before you see the data and report it with the result. The single forbidden move is trying methods until the answer you wanted appears.

Significance testing: did these two conditions actually differ?

This is the question the whole test exists to answer. You have MOS_A for encoder A and MOS_B for encoder B, each with a confidence interval. Is B really better, or is the gap inside the noise?

The fast visual check is the error bars. If the two 95% confidence intervals do not overlap, the difference is significant at roughly the 95% level — a safe call. But the reverse is a trap: intervals that do overlap can still be significantly different, because the test that matters is on the difference, which has its own, smaller standard error. Overlapping bars are a reason to compute, not a verdict of "no difference."

For two conditions, the proper tool is a t-test on the difference. When the same subjects rated both conditions — a within-subject test, the usual case — use the paired t-test, which cancels each subject's personal bias and is markedly more powerful. When different subjects rated each condition, use the two-sample t-test. Work the two-sample version, since it shows the arithmetic plainly. Encoder A: MOS 3.80, SOS 0.90, N 24. Encoder B: MOS 4.10, SOS 0.80, N 24. The difference is 0.30.

SE_diff = sqrt( SOS_A²/N_A + SOS_B²/N_B )
        = sqrt( 0.90²/24 + 0.80²/24 )
        = sqrt( 0.0338 + 0.0267 ) = sqrt(0.0605) = 0.246
t = (MOS_B - MOS_A) / SE_diff = 0.30 / 0.246 = 1.22
critical t (95%, ~46 df) ≈ 2.01
1.22 < 2.01  →  NOT statistically significant

A 0.30-point gap, with these spreads and 24 subjects each, is not resolvable — it sits inside the noise. That is not a quirk of the example; it matches the standard's own precision figures. ITU-T P.910 reports, from Pinson's 2020 measurements, that a 5-point ACR test "rarely yields" a resolvable difference (its ΔSCI) below about 0.5 for 24 subjects, 0.7 for 15, 1.1 for 9, and 1.5 for 6. Our 0.30 difference is below the ~0.5 floor for 24 subjects, so failing to reach significance is exactly what the standard predicts. If B really is better, you need a bigger panel or a more sensitive method (paired design, or the subject-behaviour model) to prove it.

When you compare three or more conditions, do not run all the pairwise t-tests. Twenty conditions make 190 pairs, and at a 5% error rate per test you would expect about nine "significant" results by chance alone. The correct tool is analysis of variance (ANOVA), which asks one global question — does any condition differ — and then a post-hoc procedure that corrects for the many comparisons (Tukey's HSD test, or a Bonferroni correction that divides the threshold by the number of comparisons). The correction is not pedantry; it is the difference between a finding and a coincidence.

Two conditions with MOS and 95% error bars: overlapping bars still require a test on the difference, which decides significance. Figure 3. Significance is a test on the difference. Non-overlapping 95% intervals imply a significant difference; overlapping intervals do not imply the opposite — the paired or two-sample test on the difference decides, and a 0.30 gap at N = 24 falls inside the noise (ΔSCI ≈ 0.5).

How many subjects do you actually need?

Everything above points back to one decision made before the test: the panel size. The standard error divides by √N, so the smallest difference you can resolve shrinks with the square root of the panel — and that, run backwards, sizes the test. P.910's precision figures are the practical table: to resolve differences of about 0.5 on a 5-point ACR scale you want 24 subjects; 15 only gets you to about 0.7; six subjects, to about 1.5, which is a quarter of the whole scale and useless for anything subtle.

This is why the headline number changed. For years the quoted floor was 15, from ITU-R BT.500-15 clause 2.5. But in 2018 Brunnström and Barkowsky ran the power analysis — balancing the risk of missing a real effect against the risk of claiming a false one — and found that 15 is often not enough to declare the differences teams care about, while the 24 long used by the Video Quality Experts Group usually is. ITU-T P.910 (10/2023) adopted that conclusion: clause 10.1 now states that at least 24 subjects must be used for a controlled environment (every stimulus rated by ≥24 after screening), and at least 35 for an uncontrolled environment. The old 15 is now explicitly a pilot-study size, fit for trend-finding, not for a result you publish. If a vendor shows you a "significant" quality win from a 12-person test, that is the first thing to question.

Resolvable MOS difference shrinks as 1 over root N, with the P.910 points and the 24-subject controlled floor marked. Figure 4. Sensitivity scales as 1/√N. The smallest resolvable MOS difference (ΔSCI) by panel size, from the P.910 figures: 1.5 at 6 subjects, 0.7 at 15 (pilot only), 0.5 at 24 (the controlled floor). Halving the resolvable difference costs four times the subjects, not twice.

You can size a test properly instead of reaching for a rule of thumb. A power analysis takes the difference you need to detect, the spread you expect (estimate it from a pilot or from the SOS hypothesis), and the error rates you will tolerate, and returns the minimum N; VQEG publishes a "Number of Subjects" tool for exactly this, referenced by P.910. As a back-of-envelope guide, because the resolvable difference scales as 1/√N, going from the 0.5 that 24 subjects buy down to a 0.3 target needs roughly 24 × (0.5/0.3)² ≈ 67 subjects. Small quality differences are expensive to prove, and pretending otherwise with a tiny panel is how teams ship invisible changes. The companion calculator below runs this estimate, the confidence interval, the SOS check, and the significance test on your own numbers.

Common mistake — reporting a MOS with no confidence interval. A bare MOS of 4.1 hides whether the panel was unanimous or split down the middle. Always publish the interval (MOS ± 1.96·SOS/√N, or the Student-t version for small N). A table of point values with no error bars is not a result a reviewer can check.

Common mistake — "the error bars overlap, so there's no difference." Overlapping 95% intervals do not mean the conditions are equal; the test is on the difference, which has a smaller standard error. Non-overlap proves significance; overlap proves nothing — you must run the paired or two-sample test.

Common mistake — trusting a 12-person "significant" result. Under ITU-T P.910 (10/2023) a controlled test needs at least 24 valid subjects; 15 is a pilot size and anything below resolves only large, obvious gaps. A small panel can produce a significant-looking number that a proper-sized test would erase.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the statistics in this article are how we keep a quality claim from outrunning its evidence. When we compare two encoder settings for a streaming or OTT client, we report each MOS with its confidence interval, run a paired test on the difference rather than eyeballing the bars, and size the panel to the smallest difference the client actually cares about — so a "win" we report is one a larger test would also find. When a panel's spread looks wrong, the SOS parameter tells us before the MOS misleads anyone. We treat 24 valid subjects as the floor for a controlled result, in line with the current P.910, and we attach the same provenance to a subjective number that we attach to our benchmark methodology: the panel, the screening rule, and the test, stated so the result holds up after the room is empty.

What to read next

Call to action

References

  1. Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union (ITU-T Study Group 12), approved 29 October 2023. Tier 1 (official standard). The controlling source for this article: at least 24 valid subjects in a controlled environment and at least 35 in an uncontrolled one, with 15 reduced to a pilot-study size (§10.1); the ΔSCI precision figures (0.5 for 24, 0.7 for 15, 1.1 for 9, 1.5 for 6 subjects; §8.1.1); the use of the Student's t-distribution for 95% confidence intervals; correlation-based post-screening (Annex A) and the bias-subtracted, consistency-weighted subject-model screening (§13.4, §13.6); standard deviation of scores (SOS) notation. Read directly from the ITU PDF on 2026-06-24. https://www.itu.int/rec/T-REC-P.910-202310-I/en
  2. Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector, approved 28 May 2023. Tier 1 (official standard). The 95% confidence interval as MOS ± 1.96·S/√N (Annex 1); the kurtosis-based hard observer-rejection procedure (the β2 test, 2σ/√20·σ band, and the two-ratio reject rule, Annex 1 §A1-2.3.1); the 15-observer figure now treated by P.910 as a pilot size. https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.500-15-202305-I!!PDF-E.pdf
  3. Recommendation ITU-T P.913 (06/2021), "Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment," International Telecommunication Union (ITU-T Study Group 12). Tier 1 (official standard). Clause 12.6 standardizes the maximum-likelihood subject-behaviour model (per-subject bias and inconsistency, bias-subtracted consistency-weighted MOS) as the recommended subject-screening and score-recovery method. https://www.itu.int/rec/T-REC-P.913
  4. Recommendation ITU-T P.1401 (07/2020), "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models," International Telecommunication Union (ITU-T Study Group 12). Tier 1 (official standard). The companion statistical-evaluation recommendation: confidence intervals, the role of the Student's t-test, RMSE, outlier ratio, and the statistical-significance procedure for comparing models against subjective data. Cited here for the significance-testing framework; applied in full in the metric-validation article. https://www.itu.int/rec/T-REC-P.1401
  5. T. Hoßfeld, R. Schatz, S. Egger, "SOS: The MOS is not enough!", 3rd International Workshop on Quality of Multimedia Experience (QoMEX), 2011. Tier 5 (peer-reviewed). The SOS hypothesis: the quadratic relationship between MOS and the standard deviation of opinion scores, the single parameter a bounded in [0,1], and its use as a per-test sanity check on subject disagreement. https://www.researchgate.net/publication/220773357_SOS_The_MOS_is_not_enough
  6. Z. Li, C. G. Bampis, "A Simple Model for Subject Behavior in Subjective Experiments," Electronic Imaging, 2020 (arXiv:2004.02067). Tier 5 (peer-reviewed). The maximum-likelihood subject-behaviour model (subject bias and inconsistency) underlying ITU-T P.913 §12.6 and the Netflix sureal implementation; the source of the "tighter confidence intervals, better outlier robustness, no hard-coded thresholds" claims for soft, model-based screening. https://arxiv.org/abs/2004.02067
  7. K. Brunnström, M. Barkowsky, "Statistical quality of experience analysis: on planning the sample size and statistical significance testing," Journal of Electronic Imaging, vol. 27, no. 5, 053013, 2018. Tier 5 (peer-reviewed). The power analysis showing the ITU 15-subject figure is often insufficient and the VQEG 24 is more likely sufficient — the evidence behind the P.910 (2023) move to ≥24; the Type I / Type II error trade-off and the sample-size calculation. https://www.spiedigitallibrary.org/journals/journal-of-electronic-imaging/volume-27/issue-5/053013/Statistical-quality-of-experience-analysis--on-planning-the-sample/10.1117/1.JEI.27.5.053013.short
  8. M. H. Pinson, "Confidence intervals for subjective tests and objective metrics that assess image, video, speech, or audiovisual quality," NTIA Technical Report TR-21-550, 2020. Tier 5 (institutional). The measurements behind P.910's ΔSCI precision figures (the resolvable-difference values by subject count for the ACR method). https://its.ntia.gov/publications/details/?pub=2766
  9. Netflix, "sureal — Subjective quality scores recovery from noisy measurements," open-source library. Tier 3 (first-party tooling). The reference implementation of the subject-behaviour MLE model (P.913 §12.6); used to compute bias-subtracted, consistency-weighted scores and their confidence intervals in practice. https://github.com/Netflix/sureal
  10. "Robustness and accuracy of mean opinion scores with hard and soft outlier detection," arXiv:2509.06554, 2025. Tier 5 (peer-reviewed preprint). Recency anchor comparing hard outlier rejection (the BT.500 kurtosis and correlation cuts) with soft, model-based handling, and the conditions under which each recovers cleaner scores. https://arxiv.org/abs/2509.06554