Crowdsourced Video Quality Testing: Speed vs Control

Why this matters

A subjective test is the ground truth every quality metric is validated against, but the classic lab version is slow, expensive, and capped at a small, homogeneous panel — which is why most teams quietly skip it and trust a metric instead. Crowdsourcing removes those barriers: you can have a few hundred ratings back in a day, from viewers on the phones and laptops your audience actually uses, for the price of a small ad campaign. This article is for the streaming, encoding, or QA engineer who needs a real human verdict on a quality change but cannot wait three weeks for a lab — and who has heard, correctly, that crowdsourced data is noisy and easy to game. The job here is to show you how to get a result you can defend: when the crowd is good enough, how to make it reliable, and when the honest answer is still "use a lab." It is the speed-versus-control half of the subjective-testing block — the design, the run, and the statistics all still apply, but the room is now the whole internet. The encoder-operator's one-screen version of subjective testing lives in the Video Encoding section's subjective-testing overview; this article is the crowdsourcing deep dive it points to.

What "crowdsourced" actually changes

Start with the thing a subjective test is supposed to produce. A subjective test collects opinion scores from human viewers and averages them into a Mean Opinion Score — written MOS, the per-clip average of the panel's ratings — which stands in for "how good does this actually look to people." The classic way to collect those scores is a controlled lab: a calibrated display, fixed lighting, a measured viewing distance, and a screened panel of viewers, all specified by ITU-R BT.500-15 (05/2023) for television images and ITU-T P.910 (10/2023) for multimedia. The lab's whole value is that it removes every variable except the video, so a difference in scores can only come from the video.

Crowdsourced subjective testing — running the same rating task on an online crowdsourcing platform, where anonymous paid workers view clips on their own hardware in their own homes and submit ratings remotely — keeps the rating task and throws away the room. Nobody calibrates the display. Nobody measures the distance to the screen or dims the lights. The panel is whoever signed up, on whatever device they own, with whatever else is happening in the room. In exchange you get four things the lab cannot give you: speed (results in hours, not weeks), scale (hundreds of viewers instead of two dozen), low cost (cents to a few dollars per worker), and diversity (real devices, real living rooms, real distraction — the conditions your product ships into).

That is the entire trade, and it is worth naming plainly: you swap control of the viewing conditions for speed, scale, cost, and realism. Neither side is free. The lab buys you a clean signal at the price of time and a small, artificial panel. The crowd buys you a big, realistic panel fast and cheap, at the price of a noisy signal you must clean up yourself. The rest of this article is about that cleanup, because a crowd test without it is not a faster subjective test — it is a pile of unreliable clicks.

A trade-off axis from controlled lab to crowdsourcing, showing what each end buys and what it costs. Figure 1. The trade in one line. Moving from a controlled lab (BT.500 / P.910 controlled) toward crowdsourcing (P.808 framework / P.910 uncontrolled) buys speed, scale, cost, and device realism, and spends the control of viewing conditions that makes a lab score clean.

Where crowdsourcing lives in the standards

Crowdsourced testing is not a workaround that lives outside the standards — it has been written into them, and knowing which document owns what keeps your method defensible.

The canonical crowdsourcing recommendation is ITU-T P.808 (06/2021), titled "Subjective evaluation of speech quality with a crowdsourcing approach." Note the word speech: P.808 standardizes crowdsourced testing for telephony and audio, using the Absolute Category Rating method — ACR, where each clip is shown once and rated on a 1-to-5 scale with no reference alongside it. P.808 is the source of the crowdsourcing machinery the whole field uses — the worker qualification job, the gold-standard and trapping questions, and the multi-stage data screening — but its subject matter is audio, not video. Treat P.808 as the framework, not as a video standard.

For video, the controlling document is now ITU-T P.910 (10/2023). Until recently, the "rate video in any environment, including at home" methods lived in a separate recommendation, ITU-T P.913, whose title — "Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment" — was built around exactly the uncontrolled setting crowdsourcing needs. That recommendation was deleted on 2 February 2024 after its content was incorporated into ITU-T P.910. So as of 2024, P.910 is the single video document that covers both the controlled lab and the uncontrolled, any-environment case the crowd represents. If you see an older method citing P.913, it has not disappeared — it has moved into P.910.

P.910 also carries the number that governs a crowd test's size. For a controlled environment it requires at least 24 valid subjects per stimulus; for an uncontrolled environment it requires at least 35 (clause 10.1, read directly from the P.910 10/2023 text). The extra subjects are the standard's own admission that an uncontrolled panel is noisier, and that you buy back some of the lost precision with more people. We work the consequences of that 35 below.

Finally, the bridge from speech framework to video practice already exists as running code. Babak Naderi and Ross Cutler at Microsoft published an open-source crowdsourced implementation of ITU-T P.910 in 2022 (the microsoft/P.910 project), implementing ACR, ACR with a hidden reference (ACR-HR), Degradation Category Rating (DCR), and Comparison Category Rating (CCR), complete with rater, environment, hardware, and network qualifications plus gold and trapping questions — the P.808 machinery ported to video. They validated it as accurate and highly reproducible against existing P.910 lab studies, which is the strongest single piece of evidence that a properly run crowd test is not a downgrade for many tasks.

Rebuilding the control you gave up

Because the crowd removes the room, every guardrail the room provided has to be rebuilt as a step in the task. There are four, and they run in order: qualify the worker, plant questions with known answers, screen the data, and design the whole thing as two stages instead of one.

Qualification. Before a worker rates a single real clip, the task checks that they can. The P.808 framework calls this the qualification job, and the video implementation extends it to four checks: the rater (can they follow instructions and use the scale, often probed with a short training and a comprehension check), the environment (is the viewing setting sane), the hardware (screen size, resolution, and color rendering — a phone in portrait orientation cannot judge a 4K clip), and the network (enough bandwidth to stream the test clips without the player itself stalling and contaminating the rating). A worker who fails qualification never enters the real test, which is cheaper than removing their data afterward.

Gold-standard questions. A gold-standard trial is a clip whose correct rating is known in advance — a pristine, untouched reference that any honest viewer rates near the top of the scale, or a deliberately wrecked clip that any honest viewer rates near the bottom. Scatter a few through the session. A worker who rates the pristine clip a 2, or the wrecked clip a 5, is either not watching or not understanding, and their whole submission becomes suspect. Gold questions catch the viewer whose judgment is wrong.

Trapping questions. A trapping question — sometimes called an attention or "honeypot" trial — is different: it carries an explicit instruction inside the content, such as a clip that says on screen "to show you are paying attention, select 2 for this video." There is no quality judgment involved; the only way to answer correctly is to be watching and reading. Trapping questions catch the bot and the click-through speeder who is paging through trials without looking. Gold catches wrong judgment; trapping catches absent attention. You want both. The open implementations interleave roughly one gold and one trapping item per ten or so real trials.

Data screening. After the session, the same statistical screening a lab uses still applies, and it does more work here. The correlation method of P.910 keeps each subject only if their ratings track the panel's MOS closely enough; outlier rules flag votes that sit far outside the panel's spread; and crowd-specific checks reject sessions that finished impossibly fast, that answered every trial with the same score, or — a newer threat — that came over a remote-desktop connection. A 2025 study on reliable participation across platforms documented workers exploiting video metadata and routing through remote-desktop links that silently re-encode the video before it reaches their eyes, and proposed detectors for it; the takeaway is that the screening list is not static, and a crowd vendor should be able to tell you what they check.

Two stages, not one. The best-practices paper that shaped the field — Hoßfeld and colleagues' "Best Practices for QoE Crowdtesting" (IEEE Transactions on Multimedia, 2014) — showed that splitting the work into a reliability/qualification stage followed by a separate rating stage yields markedly more reliable results than a single combined job. Qualify first, rate second; do not let an unscreened crowd anywhere near your real stimuli.

A funnel from recruited workers through qualification, gold and trapping checks, and data screening to a set of valid ratings. Figure 2. The reliability pipeline. Recruited workers pass through qualification, then a rating job seeded with gold and trapping trials, then data screening; the orange band is the discarded fraction, which is why you recruit many more than the 35 valid subjects you need to keep.

Dimension	Controlled lab (BT.500-15 / P.910 controlled)	Crowdsourced (P.808 framework / P.910 uncontrolled)	Where the crowd lies / its limit
Viewing conditions	Calibrated display, fixed light, measured distance	Whatever the worker owns and wherever they sit	No control; conditions are a hidden, uneven variable
Panel	Small, screened, often homogeneous (≥24 valid)	Large, diverse, self-selected (≥35 valid)	Diversity is realistic but adds noise to every score
Speed	Days to weeks (scheduling, sessions)	Hours to a day	Fast — but rushing screening to keep it fast reintroduces noise
Cost	High (lab time, supervised sessions)	Low (cents to a few dollars per worker)	Cheap per rating, but rejected submissions are paid-for waste
Reliability control	Physical (the room enforces it)	Software (qualification, gold, trapping, screening)	Only as good as the guardrails you built; gaming evolves
Best for	Small near-threshold differences, reference-display content	Large differences, rankings, device-realistic QoE, scale	Absolute lab-grade MOS; color/HDR-critical judgments

The two known-answer trials in that pipeline do different jobs, and a crowd test needs both. A gold-standard question is a clip whose correct rating is known in advance; a trapping question is a clip carrying an in-content instruction. One polices judgment, the other polices attention.

Two cards contrasting gold-standard questions and trapping questions, and the failure each one catches. Figure 3. Two known-answer trials. A gold-standard question (a clip whose correct rating is known — pristine near 5, wrecked near 1) catches wrong judgment; a trapping question (an in-content instruction such as "select 2") catches absent attention and bots. Embed roughly one of each per ten real trials and screen on both.

How many workers — and why you recruit more than you need

The standard says at least 35 valid subjects must rate each stimulus in an uncontrolled test. The word that costs money is valid. Qualification, gold, trapping, and screening all discard people, so the number you recruit is the number you need after the wash, grossed up by the rejection rate.

Write the rejection rate as a fraction — call it r. If you expect to throw out 40% of submissions (a realistic figure for an open crowd before tightening), then only 60% survive, and the number to recruit to net 35 valid is:

recruit = valid_needed / (1 − r)
        = 35 / (1 − 0.40)
        = 35 / 0.60
        = 58.3  →  recruit 59

The relationship is unforgiving at the top end. At a 30% reject rate you recruit 50; at 40%, 59; at 50%, you recruit 70 to keep 35. Halving your reject rate — by qualifying harder up front, so fewer bad sessions reach the rating stage — is worth more than almost any other tuning, because you pay for every recruited worker, valid or not.

Now scale it to a real study. Suppose you have 200 clips, each of which needs at least 35 valid ratings, and each worker rates 50 clips in a sitting. The total valid ratings you need is 35 × 200 = 7,000. At 50 valid ratings per session that is 7,000 / 50 = 140 valid sessions. Grossed up for a 40% reject rate, you recruit 140 / 0.60 = 233.3 → 234 sessions. And each session is not 50 trials but about 60, once you interleave roughly five gold and five trapping items to police it. None of this is exotic; it is just the bookkeeping that separates a crowd test that returns 35 solid votes per clip from one that returns 20 and a wide confidence interval.

That last phrase is the link back to the statistics: the precision of a MOS scales with one over the square root of the valid panel size, so under-recruiting does not just shrink your sample, it widens every error bar and can push a real difference back inside the noise. The standard's 35 is a floor for a reason. The companion planner below does this arithmetic — recruit count, gold and trapping budget, and a rough time-and-cost estimate — from your own clip count and reject rate.

When crowdsourced MOS is good enough — and when you need a lab

The honest decision rule is about the size of the difference you are trying to see, and about whether the screen itself is part of what you are testing.

Crowdsourcing is good enough — often better than a lab — when the difference is comfortably visible, when a relative ranking is what you need, and when the home is the real viewing context anyway. Comparing two encoder ladders that are clearly apart, ranking five candidate presets, screening a large set of clips to find the few worth a careful look, or measuring the quality of user-generated and live content (which has no pristine reference and is watched on phones in the first place) all play to the crowd's strengths: scale, speed, and device realism. The validated reproducibility of the crowdsourced P.910 implementation against lab studies is the evidence that, run properly, the crowd lands in the same place as the lab for these jobs.

You need a lab when the difference is small and near the threshold of what anyone can see, or when the display is a variable you must hold still. The smallest MOS difference a 5-point test reliably resolves is roughly 0.5 even with a clean 24-subject lab panel; an uncontrolled crowd, noisier per rating, needs its larger panel just to match that, and still struggles below it. So a codec decision that hinges on a 0.2–0.3 MOS gap is a lab job, not a crowd job — no amount of recruiting cleanly resolves a difference smaller than your per-rating noise. The same goes for anything where the screen is the experiment: HDR and 4K content judged on a reference monitor, color-critical grading, or studies where calibrated luminance and contrast are the whole point. A crowd of uncalibrated phones cannot answer a question about a calibrated display.

A decision tree choosing between crowdsourcing and a lab based on difference size, display dependence, and ground-truth need. Figure 4. Crowd or lab? If the difference is near-threshold, the display is the variable, or you need publishable ground truth, book a lab; otherwise the crowd is faster, cheaper, and — run with the full reliability pipeline — usually just as accurate for the ranking you need.

There is also a middle path the standards endorse: use the crowd to screen broadly and fast, then confirm the one or two close calls that survive in a small controlled test. The crowd narrows the field; the lab settles the photo finish.

Common mistake — no gold or trapping questions. Without trials whose answers you already know, you cannot tell a careful viewer from a bot, and your "MOS" is an average that includes noise you can't see. Embed gold (known-quality clips) and trapping (in-content instructions) trials, roughly one of each per ten real clips, and screen on both — this is the core of the ITU-T P.808 framework and the open P.910 implementation.

Common mistake — recruiting exactly 35. The standard's 35 is the valid count after screening, not the number to invite. If you recruit 35 and reject 40%, you keep 21, and every confidence interval widens. Recruit 35 / (1 − reject_rate) and verify the surviving count per clip before you trust a single MOS.

Common mistake — reading crowd MOS as an absolute lab number. An uncontrolled panel's scale can shift with device, context, and instructions, so a crowd MOS of 4.1 is not interchangeable with a lab MOS of 4.1. Compare within the same test — the difference or ranking between your conditions — not the absolute number against a lab result run elsewhere.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and we treat crowdsourcing as the fast, wide first pass, not a replacement for a controlled test. When a client needs to rank encoder settings or sanity-check a quality change across the real phones and laptops their audience uses, a crowd test run with the full reliability pipeline — qualification, gold and trapping trials, and P.910 screening — gives a defensible answer in a day instead of three weeks. When a decision turns on a near-threshold difference, on HDR or color, or on a number that has to survive outside scrutiny, we run a small controlled test instead, and we keep the same provenance discipline we apply to our benchmark methodology: the panel, the platform, the screening rules, and the reject rate, all reported so the result holds up after the test closes.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your crowdsourced video quality testing plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.

References

Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union (ITU-T Study Group 12), approved 29 October 2023. Tier 1 (official standard). The controlling video standard after it absorbed ITU-T P.913's any-environment content; requires at least 24 valid subjects for a controlled environment and at least 35 for an uncontrolled one (clause 10.1); defines ACR, ACR-HR, DCR, and PC methods. Read directly from the ITU text on 2026-06-24. https://www.itu.int/rec/T-REC-P.910-202310-I/en
Recommendation ITU-T P.808 (06/2021), "Subjective evaluation of speech quality with a crowdsourcing approach," International Telecommunication Union (ITU-T Study Group 12), in force (supersedes P.808 06/2018). Tier 1 (official standard). The canonical crowdsourcing framework: the worker qualification job, gold-standard and trapping questions, and multi-stage data screening, defined for ACR — for speech, but the methodology the video field reuses. Verified on the ITU recommendation page 2026-06-24. https://www.itu.int/rec/T-REC-P.808
Recommendation ITU-T P.913 (06/2021, deleted 2 February 2024), "Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment," International Telecommunication Union (ITU-T Study Group 12). Tier 1 (official standard). The original "any environment" video/audiovisual methods recommendation — the source of the uncontrolled and crowdsourced video methods; deleted on 2 February 2024 after its content was incorporated into ITU-T P.910. Verified on the ITU recommendation page 2026-06-24. https://www.itu.int/rec/T-REC-P.913
Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector. Tier 1 (official standard). The controlled-lab reference for viewing conditions (calibrated display, fixed lighting, measured distance) against which the crowdsourced, uncontrolled setting is contrasted; the source of the lab end of the speed-versus-control trade. https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.500-15-202305-I!!PDF-E.pdf
T. Hoßfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, P. Tran-Gia, "Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing," IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 541–558, 2014. Tier 5 (peer-reviewed). The foundational best-practices paper: the reliability problem, the influence of incentives and unknown environment, and the two-stage (reliability then rating) crowdtesting design that yields more reliable results. https://www.keimel.org/publication/hossfeld-tom-2014/Hossfeld-TOM2014.pdf
B. Naderi, R. Cutler, "A crowdsourced implementation of ITU-T P.910" (a crowdsourcing approach to video quality assessment), 2022 (arXiv:2204.06784; ICASSP 2024; open-source microsoft/P.910). Tier 5 (peer-reviewed) / first-party tooling. The open-source crowdsourced extension of P.910: ACR, ACR-HR, DCR, and CCR with rater, environment, hardware, and network qualifications plus gold and trapping questions, validated as accurate and highly reproducible against P.910 lab studies. https://arxiv.org/abs/2204.06784
B. Naderi, R. Cutler, "An Open source Implementation of ITU-T Recommendation P.808 with Validation," Interspeech 2020 (arXiv:2005.08138). Tier 5 (peer-reviewed) / first-party tooling. The reference open-source P.808 crowdsourcing pipeline — qualification (hearing, environment, hardware), gold and trapping questions, and the data-screening rules, with the ~11-trials-per-task structure (one gold, one trapping); the speech sibling of the video implementation. https://arxiv.org/abs/2005.08138
"Ensuring Reliable Participation in Subjective Video Quality Tests Across Platforms," 2025 (arXiv:2509.20001). Tier 5 (peer-reviewed preprint). Recency anchor on evolving crowd reliability threats: workers exploiting video metadata and routing through remote-desktop (RD) connections that re-encode the video, with proposed objective and subjective RD detectors and a comparison of two mainstream platforms. https://arxiv.org/abs/2509.20001
"Comparative Study of Subjective Video Quality Assessment Test Methods in Crowdsourcing for Varied Use Cases," 2025 (arXiv:2509.20118). Tier 5 (peer-reviewed preprint). Recency anchor comparing crowdsourced test methods (ACR and relatives) across use cases, informing which method fits which crowd task. https://arxiv.org/abs/2509.20118

Why this matters

What "crowdsourced" actually changes

Where crowdsourcing lives in the standards

Rebuilding the control you gave up

How many workers — and why you recruit more than you need

When crowdsourced MOS is good enough — and when you need a lab

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Crowdsourced Video Quality Testing: Speed vs Control

Why this matters

What "crowdsourced" actually changes

Where crowdsourcing lives in the standards

Rebuilding the control you gave up

How many workers — and why you recruit more than you need

When crowdsourced MOS is good enough — and when you need a lab

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

ITU-T P.910

ITU-T P.808

ACR-HR

Confidence interval

ITU-R BT.500

Ground truth

Video quality measurement

Viewing distance