Why this matters

If your product carries sound — a video conferencing app, a streaming service, a telemedicine platform, an e-learning tool — you will eventually face a question no automated number can settle: does this actually sound better to a person? A new codec, a louder dialogue mix, an aggressive noise suppressor, a spatial-audio feature: each promises an improvement that only a human can confirm or deny. Objective metrics flag regressions cheaply, but when the change is novel, or when you are choosing between two options a customer will judge with their ears, you have to ask people. The problem is that listening tests are easy to run badly: pick the wrong method, too few listeners, or no statistics, and you will ship a decision built on a coin flip you mistook for evidence. This article is for the product manager or founder who has to commission or interpret a listening test and wants to know what the result is worth, and for the engineer who has to run one. By the end you will know which of the four methods fits your question, how many listeners that method needs, and the three mistakes that turn a clear-looking chart into an illusion.


Why we still ask humans

Every objective audio quality metric in our companion piece — PESQ, POLQA, and ViSQOL — exists to predict one thing: what a panel of human listeners would have said. The metric is the understudy; the listening test is the lead. So when the metric's prediction is in doubt — a new neural codec it was never trained on, a denoiser that does something the model has never seen — you go back to the source and ask people directly.

Asking people sounds simple. It is not. The result of a listening test is a number, and like any measurement it carries noise: the same listener rates the same clip differently on Monday and Friday; one listener is harsh, another generous; a clip that sounds great on studio headphones sounds muddy on a laptop. A subjective test method is the set of rules — what you play, what you ask, who listens, how you score — designed to average that noise out until a real difference shows through. Get the rules right and a listening test is the most trustworthy evidence you can buy. Get them wrong and it is theatre.

The four methods below are the standardised rule sets. They split along two axes. The first axis is rating versus comparison: do you ask listeners to put a number on each clip on its own (MOS, MUSHRA), or to choose or match between clips (A/B, ABX)? The second is how good the audio is: telephone-grade speech, near-transparent music codecs, or something in between. Match the method to both axes and the rest of the test design follows.

Decision panel splitting the four subjective testing methods along two axes. The vertical axis is rating versus comparison; the horizontal axis is the quality range of the audio under test. MOS sits in the rating column for telephone-grade speech; MUSHRA sits in the rating column for intermediate-quality codecs; A/B sits in the comparison column as a preference test; ABX sits in the comparison column as a discrimination test for near-transparent differences. A band across the bottom states the rule: rate to grade, compare to choose or to detect. Figure 1. The four methods answer four different questions. Pick by what you need to know, not by what is easiest to run.

MOS — the 1-to-5 rating everyone quotes

MOS stands for Mean Opinion Score, and it is the oldest and most quoted number in audio. The method is the one your phone company has used for decades: play a short speech clip to a listener, ask them to rate it on a five-point scale — 5 is excellent, 4 good, 3 fair, 2 poor, 1 bad — collect ratings from many listeners across many clips, and average them. That average is the Mean Opinion Score. The procedure is defined by the International Telecommunication Union — the United Nations body that standardises telecommunications — in its recommendation ITU-T P.800, and the specific rating task above is called Absolute Category Rating, or ACR, because each listener rates each clip against the absolute five-point scale on its own, not against another clip.

P.800 actually defines three related tasks, and the difference matters because picking the wrong one wastes a panel. ACR is the plain "rate this clip 1 to 5" task; its result is the familiar MOS. DCR — Degradation Category Rating — plays the clean reference first, then the degraded clip, and asks how annoying the degradation is on a five-point scale; its result is a DMOS (Degradation MOS) and it is more sensitive when the impairments are small. CCR — Comparison Category Rating — plays two clips and asks whether the second is better or worse than the first on a seven-point scale from −3 to +3; its result is a CMOS (Comparison MOS) and it is the right tool when you are comparing two systems that are both already good. The headline rule: ACR when you want an absolute quality score, DCR or CCR when the differences are subtle and you need the extra sensitivity of a side-by-side.

MOS has one famous weakness you must respect: it is not an absolute, portable ruler. A MOS of 4.0 from one lab is not the same as a 4.0 from another, because the number drifts with the listener pool, their language, the playback hardware, and even the range of clips in the test — listeners unconsciously stretch their ratings to fill the scale they are given. So a MOS is only ever meaningful within one test, comparing the systems that were in that test against each other. Quoting a bare "our codec scores 4.2 MOS" with no test context is, strictly, meaningless. The number lives or dies by the conditions that produced it.

A newer recommendation, ITU-T P.808, extends MOS testing to crowdsourcing — paying many remote workers small amounts to rate clips on their own devices through a web browser — and it covers the same ACR, DCR, and CCR tasks. Crowdsourcing trades the controlled room for scale and speed: you lose the calibrated headphones and quiet booth, but you gain hundreds of listeners in a day, and ITU-T's own validation work found that a P.808 test, run by its rules, correlates strongly with a lab test. P.808 is how most modern speech-quality panels actually run in 2026, and Microsoft ships an open-source P.808 Toolkit that wires the whole thing to Amazon Mechanical Turk.

MUSHRA — the method for codecs and "good but not perfect" audio

MUSHRA is an acronym that unpacks into its whole method: MUltiple Stimuli with Hidden Reference and Anchor. It is defined by ITU-R BS.1534 — note the ITU-R, the radiocommunication sector, the same sector that owns broadcast audio standards, rather than the ITU-T telecom sector that owns MOS — and the current edition is BS.1534-3, approved in October 2015. MUSHRA is built for intermediate quality: audio that is clearly not perfect but clearly not broken, which is exactly where audio codecs at sensible bitrates live. If you are comparing four Opus or AAC settings that all sound decent, MUSHRA is the method.

The genius of MUSHRA is in its name. The listener sees multiple systems at once — typically up to nine processed versions of the same source — on one screen, each with a play button and a slider scored from 0 to 100, and rates them all together. Two of those sliders are traps. One is the hidden reference: the untouched original, slipped in unlabelled among the test systems. A listener who is paying attention should rate it at or near 100; a listener who scores the hidden reference at 70 was not really listening, and you can detect and exclude them. The other trap is the anchor: a deliberately degraded version of the source — the standard anchor is the original passed through a 3.5 kHz low-pass filter, which sounds muffled like a phone call — placed near the bottom of the scale to give every listener a shared, fixed point for "noticeably bad." The hidden reference pins the top of the scale; the anchor pins the bottom; everything real sits between them.

Because every system is heard against the others and against those two fixed points, MUSHRA results are far more stable and comparable than raw MOS — listeners can switch instantly between versions and hear the difference directly, which is much more sensitive than rating each one in isolation from memory. MUSHRA also bakes in listener screening: BS.1534-3's post-screening rule excludes any listener who scores the hidden reference below 90 on more than 15% of the test items, on the logic that such a listener either cannot hear the differences or is not trying. The recommendation calls for expert listeners, keeps individual test sessions short — under about ten minutes — to avoid fatigue, and recommends roughly 15 or more valid listeners after screening for a reliable result.

Diagram of a single MUSHRA test screen. A reference play button sits at the top labelled "known reference, play freely." Below it a row of seven sliders runs from 0 to 100, each with its own play button and a quality-band scale on the side reading Excellent 80 to 100, Good 60 to 80, Fair 40 to 60, Poor 20 to 40, Bad 0 to 20. One slider is highlighted and labelled "hidden reference, should score near 100"; another is labelled "anchor, 3.5 kHz low-pass, sits near the bottom"; the remaining sliders are labelled "systems under test." A caption explains the hidden reference pins the top of the scale and the anchor pins the bottom. Figure 2. A MUSHRA screen rates many systems at once on a 0-to-100 scale, with a hidden reference and a low-pass anchor that screen out inattentive listeners and fix both ends of the scale.

A word on MUSHRA's sibling, ITU-R BS.1116-3 (February 2015), because you will meet it whenever the differences are tiny. Where MUSHRA grades intermediate quality, BS.1116 is the method for small impairments — near-transparent codecs where you are asking "can anyone tell this apart from the master at all?" Its method is the double-blind triple-stimulus with hidden reference, sometimes written ABC/HR: the listener hears three stimuli per trial — a known reference plus two unknowns, one of which is a hidden copy of the reference and one the system under test — and grades the impairment on a continuous scale from 5.0 (imperceptible) down to 1.0 (very annoying). The result is a Subjective Difference Grade. The rule of thumb between the two: BS.1116 for near-transparent, audiophile-grade differences; MUSHRA for the broader intermediate range where degradations are audible but acceptable.

A/B — the simple preference test

Sometimes you do not need a graded score at all. You have two options — the old dialogue mix and a louder one, codec A and codec B — and the only question is which do people prefer? That is an A/B test: play clip A, play clip B, ask the listener to choose the one they like better. The output is not a quality number; it is a preference share — "68% of listeners preferred the louder mix" — and that is often exactly the business answer you want.

A/B is cheap, fast, and intuitive, which is its strength and its trap. The strength is that anyone can do it and the result is easy to explain to a stakeholder. The trap is that "preference" is not "quality," and a preference test is wide open to bias if you are not careful. Two biases dominate. The first is order bias: listeners tend to prefer whichever clip they heard second (or sometimes first), so you must randomise the order and balance it across listeners. The second is the louder-sounds-better effect, which is so strong it deserves its own warning: of two otherwise identical clips, the louder one wins a preference test almost every time, regardless of quality. This is why any honest A/B test of codecs or mixes must loudness-match the clips first — normalise them to the same integrated loudness so the comparison is about quality, not volume. We cover the measurement that makes this possible in our piece on loudness, peak, RMS, and LUFS.

A/B answers "which is preferred." It does not tell you whether the difference is even audible — and that is a different, stricter question with its own method.

Pitfall — forgetting to loudness-match. The single most common way to ruin a listening test is to compare clips at different volumes. A 1 dB difference in loudness is enough to bias a preference test, and people reliably mistake "louder" for "better." Before any A/B, ABX, or MUSHRA test, normalise every clip to the same integrated loudness (LUFS). If you skip this, you are measuring volume, not quality, and your "winner" is just the loudest file. This one mistake invalidates more amateur listening tests than any other.

ABX — the strict test of "can anyone even hear it?"

ABX is the most rigorous of the four, and the only one that can prove a difference is real rather than imagined. The setup: the listener is given three clips per trial — A (one known sample), B (the other known sample), and X (which is secretly either A or B, chosen at random). The listener may replay all three as often as they like, and must answer one question: is X the same as A or the same as B? They are forcing themselves to identify the unknown by ear alone.

Why this is powerful comes down to chance. If a listener genuinely cannot hear any difference between A and B, then X is unidentifiable and they are reduced to guessing — and a guess is right 50% of the time. So if a listener (or a panel) identifies X correctly far more than half the time across many trials, the difference must be audible; if they hover around 50%, it is not. ABX converts a vague question — "is this difference real?" — into a coin-flip problem you can settle with statistics. It is the standard way to debunk audiophile claims ("can you really hear the difference between this 320 kbps MP3 and the lossless master?") and the right tool whenever you need to confirm that a change is perceptible at all before you spend effort grading how much better it is.

Here is the arithmetic that makes an ABX result mean something. Suppose one listener runs 16 ABX trials and gets 12 correct. Is that proof they can hear the difference, or could a pure guesser have done that well by luck? You test it against the null hypothesis "they were guessing," where each trial is a 50/50 coin flip. The probability of getting at least 12 of 16 correct by pure chance is the sum of the binomial probabilities for 12, 13, 14, 15, and 16 hits:

P(≥12 of 16, p = 0.5) = 0.038

That is below the conventional 0.05 threshold, so you reject "they were guessing" — 12 of 16 is statistically significant evidence the difference is audible. Now contrast it: 10 of 16 correct gives P(≥10) = 0.227, well above 0.05, which proves nothing — a guesser clears that bar more than one time in five. The lesson is exact: a majority is not proof. You need enough trials and a high enough hit rate that pure luck becomes implausible, and only the binomial test tells you where that line is.

Diagram of an ABX trial and its statistical reasoning. The left half shows the trial mechanics: three labelled play buttons, A (known), B (known), and X (secretly A or B), with the listener question "is X the same as A or B?" The right half shows a bar chart of the binomial result for 16 trials, with the 50 percent guessing line marked, a bar at 10 of 16 labelled "p = 0.23, not significant" in the warning colour, and a bar at 12 of 16 labelled "p = 0.038, significant" in the positive colour. A caption states that a majority is not proof; only the binomial p-value decides. Figure 3. ABX turns "can you hear it?" into a coin-flip problem. Twelve of sixteen correct is significant; ten of sixteen is not. The binomial test, not the majority, decides.

How many listeners do you actually need?

The most common question — and the one most often answered with a guess — is how many people to recruit. There is no single number, because it depends on the method, the size of the effect you are chasing, and the noise in your panel. But there are defensible defaults, and the logic behind them is worth understanding so you can defend your choice.

The driving idea is the standard error of the mean: the noise on your average score shrinks with the square root of the number of listeners. So the average of a rating from 25 listeners is twice as precise as from 6, not four times — doubling precision costs you four times the people. This is why listening panels hit diminishing returns: the jump from 5 to 15 listeners buys you a lot; the jump from 30 to 90 buys you much less. The four methods land in different places:

Method Question it answers Typical valid listeners Statistical test Output
MOS (P.800 ACR) Absolute quality, 1–5 24+ (lab) / 100s (P.808 crowd) Mean + confidence interval; ANOVA across systems Mean Opinion Score
MUSHRA (BS.1534-3) Intermediate codec quality, 0–100 ~15+ after screening Mean + CI; repeated-measures ANOVA 0–100 score with hidden ref / anchor
A/B preference Which is preferred 20–40+ Two-sided binomial / sign test on the split Preference share (%)
ABX discrimination Is the difference audible at all 1 trained listener can suffice with enough trials; pool for population claims Binomial test vs 50% p-value: audible or not

Read the table as a set of rules of thumb, not laws. For MOS, a controlled lab test typically wants 24 or more listeners; a P.808 crowdsourced test happily uses hundreds, which is part of why crowdsourcing won. For MUSHRA, BS.1534-3 points at roughly 15 or more valid listeners after the hidden-reference screening has thrown out the inattentive ones — so recruit more than 15 to land above 15. For A/B, you need enough people that the preference split clears the binomial noise: 20–40 is a sensible floor, more if the split is close to 50/50. For ABX, the structure is different — a single trained listener running many trials can establish that they can hear a difference, but to claim something about the general population you pool listeners and trials. The unifying discipline: decide the test and the listener count before you collect data, not after you peek at it.

A worked example: reading a MUSHRA result

Numbers on a quality scale only mean something once you have run them through the statistics once. Suppose you test four codec settings for a streaming product against the same set of music clips in a MUSHRA test, recruit 20 listeners, and after the hidden-reference screening 17 remain valid. You average each system's 0-to-100 scores across the 17 listeners and the clips and get:

  • Hidden reference: 98 (sanity check — near 100, as it must be)
  • System A — high bitrate: mean 82, standard deviation 9 across listeners
  • System B — medium bitrate: mean 74, standard deviation 11
  • Anchor (3.5 kHz low-pass): 22 (sanity check — near the bottom, as it must be)

Do A and B really differ, or is the 8-point gap inside the noise? Compute the standard error of each mean — the standard deviation divided by the square root of the listener count:

standard error (A) = 9 ÷ √17 = 9 ÷ 4.12 = 2.18 standard error (B) = 11 ÷ √17 = 11 ÷ 4.12 = 2.67

A 95% confidence interval is roughly two standard errors either side of the mean, so System A is 82 ± 4.4 (about 77.6 to 86.4) and System B is 74 ± 5.3 (about 68.7 to 79.3). Those intervals barely overlap, which signals a likely-real difference — but "barely overlapping" is exactly the case where you stop eyeballing and run the proper test (a repeated-measures ANOVA, which accounts for the fact that the same listeners rated every system). The hidden reference at 98 and the anchor at 22 confirm the panel was attentive and used the full scale, so you trust the data. The means give you the picture; the confidence intervals and the ANOVA tell you whether the picture is real. A result reported without them — "A scored 82, B scored 74, A wins" — is a guess wearing a number's clothes.

Subjective and objective testing work together

Subjective testing is the ground truth, but it is slow and costs real money and listener time, so you cannot run it on every code change. Objective metrics — covered fully in our piece on PESQ, POLQA, and ViSQOL — fill the gap by predicting a MOS automatically, fast enough for a build pipeline. The two are not rivals; they are a relay. You run a subjective test a few times a year to establish the truth and to calibrate your objective metric — to confirm it still tracks human opinion for your specific content. Then the objective metric watches every build in between, and you trust its green checkmark precisely because a human panel confirmed it agrees with people.

This relationship has a failure mode worth naming. Objective metrics drift away from human opinion exactly when your pipeline does something the metric was never trained on — a novel neural codec, an aggressive denoiser, a generative-audio feature. That drift is invisible until you run the human test that catches it. So the rule is simple: when the metric and the humans disagree, the humans win, and the metric needs recalibrating. Modern no-reference metrics like DNSMOS P.835 and NISQA were themselves trained on large crowdsourced P.808 listening tests, which is the same principle written into software — every automated quality number traces back, eventually, to a room (or a crowd) of people who listened.

Tools you can actually use

You do not need a broadcast laboratory to run a defensible listening test in 2026. Several open frameworks implement these methods correctly out of the box. webMUSHRA, from Fraunhofer IIS and the International Audio Laboratories Erlangen, is a browser-based framework that implements BS.1534-compliant MUSHRA tests (plus paired comparison and other methods) and is the de facto standard for academic codec evaluation. BeaqleJS is an open HTML5/JavaScript framework that runs both ABX and MUSHRA tests in any modern browser. For speech-quality MOS at crowdsourcing scale, Microsoft's P.808 Toolkit implements the full ACR/DCR/CCR set against Amazon Mechanical Turk. The discipline still matters more than the tool: loudness-match your clips, randomise order, screen your listeners, fix your method and sample size before you start, and run the right statistical test at the end. A tool gives you the screen; the rigour is on you.

Where Fora Soft fits in

We build video conferencing, telemedicine, e-learning, OTT streaming, and surveillance products, and in every one of them the question "does this actually sound better?" eventually lands on a human ear. We use subjective testing the way this article describes: A/B and ABX tests when we need to know whether a change is even perceptible, MUSHRA when we are choosing between codec settings for a streaming product, and crowdsourced MOS to keep our automated quality gates honest. A clinician needs to hear a patient clearly and a student needs every word of a lecture, so for those products the listening test is not an academic exercise — it is the last check before a quality decision ships. These tests sit alongside the rest of the audio toolkit; see the WebRTC audio pipeline end-to-end and the production audio-problem runbook. The method is only as trustworthy as the screening and the statistics behind it.

What to read next

CTA

Call to action

References

  1. ITU-T Recommendation P.800, Methods for subjective determination of transmission quality (1996) — defines MOS and the Absolute Category Rating (ACR), Degradation Category Rating (DCR), and Comparison Category Rating (CCR) listening tasks. ITU-T. https://www.itu.int/rec/T-REC-P.800 (Standards-body primary source for MOS, DMOS, CMOS.)
  2. ITU-T Recommendation P.808, Subjective evaluation of speech quality with a crowdsourcing approach (2021) — extends P.800 ACR/DCR/CCR to crowdsourced testing; validated to correlate with lab tests. ITU-T. https://www.itu.int/rec/T-REC-P.808 (Standards-body primary source for crowdsourced MOS.)
  3. ITU-R Recommendation BS.1534-3, Method for the subjective assessment of intermediate quality level of audio systems (MUSHRA) (October 2015) — multiple stimuli with hidden reference and anchor; 0–100 scale; the 15%/score-below-90 post-screening rule; expert listeners; short sessions. ITU-R. https://www.itu.int/rec/R-REC-BS.1534 (Standards-body primary source for MUSHRA; current edition verified against the ITU-R catalogue as of 2026-06-07.)
  4. ITU-R Recommendation BS.1116-3, Methods for the subjective assessment of small impairments in audio systems (February 2015) — double-blind triple-stimulus with hidden reference (ABC/HR); continuous 5.0–1.0 grading scale; Subjective Difference Grade. ITU-R. https://www.itu.int/rec/R-REC-BS.1116 (Standards-body primary source for small-impairment / near-transparent testing.)
  5. ITU-T Recommendation P.835, Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm (2003) — the SIG/BAK/OVRL three-scale method underlying DNSMOS. ITU-T. https://www.itu.int/rec/T-REC-P.835 (Standards-body primary source; cited for the no-reference-metric calibration claim.)
  6. M. Schoeffler, S. Bartoschek, F.-R. Stöter, M. Roess, S. Westphal, B. Edler, J. Herre, webMUSHRA — A Comprehensive Framework for Web-based Listening Tests, Journal of Open Research Software, 6(1):8, 2018. https://openresearchsoftware.metajnl.com/articles/10.5334/jors.187 (Primary source from the tool's authors; BS.1534 compliance and supported methods.)
  7. S. Kraft, U. Zölzer, BeaqleJS: HTML5 and JavaScript based framework for the subjective evaluation of audio quality, Linux Audio Conference 2014. https://github.com/HSU-ANT/beaqlejs (Source for the open ABX + MUSHRA browser framework.)
  8. B. Naderi, R. Cutler, An Open Source Implementation of ITU-T Recommendation P.808 with Validation, INTERSPEECH 2020. arXiv:2005.08138. https://arxiv.org/abs/2005.08138 (Source for the P.808 crowdsourcing toolkit, ACR/DCR/CCR implementation, and lab-correlation validation.)
  9. C. K. A. Reddy, V. Gopal, R. Cutler, DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors, ICASSP 2022. arXiv:2110.01763. https://arxiv.org/abs/2110.01763 (Source for the claim that no-reference metrics are trained on crowdsourced P.808/P.835 listening data.)
  10. ITU-T Recommendation P.810, Modulated noise reference unit (MNRU) (1996) — the reference degradation underlying anchor design in speech testing. ITU-T. https://www.itu.int/rec/T-REC-P.810 (Standards-body source for anchor / reference-degradation methodology.)

Living standards note: ITU-R and ITU-T recommendations are revised on their own schedules. The current editions cited above — BS.1534-3 (2015), BS.1116-3 (2015), P.800 (1996), P.808 (2021) — were verified against the ITU catalogue as of 2026-06-07. Where a future revision changes a screening threshold or scale, update §MUSHRA and the listener-count table accordingly.