Why this matters
Choosing the wrong test method is the most expensive mistake in subjective testing, because you only discover it after the data is collected: a fast method that cannot see the difference you care about wastes the whole panel, and a sensitive method run on too many conditions never finishes. This article is for the video engineer, QA lead, or codec evaluator who has to design or commission a test — or who has to read someone else's quality paper and judge whether the method fits the claim. It is the methods article in our subjective-testing block: the previous piece, MOS, DMOS, and the rating scales, explained the numbers these tests produce, and the next one, designing a subjective test that survives scrutiny, covers building the chosen test. The encoder-operator's one-screen version of this material lives in the Video Encoding section's subjective-testing overview; everything here is the deep treatment behind it.
The one decision behind every method: one stimulus or two
Before the acronyms, there is a single fork that organizes all of them. A subjective test either shows the viewer one video and asks them to judge it alone, or shows them two videos and asks them to judge one against the other. Everything else — the scale, the analysis, the cost — follows from that choice.
A method that shows one clip at a time is a single-stimulus method. The viewer meets each clip the way a real viewer meets a video in the wild: on its own, with nothing to compare it against. ACR is the single-stimulus method.
A method that shows two clips per judgment is a double-stimulus method. The viewer always has a second clip — usually the pristine original — in view or in recent memory, and rates the relationship between the two. DCR, CCR, and PC are all double-stimulus methods, and they differ only in what they ask the viewer to do with the pair.
The trade is the heart of this article. A single clip is fast to show and lets you rate many conditions in a session, but the viewer's judgment is contaminated by how much they liked the content and by their personal harshness. Two clips cancel those biases and reveal smaller differences, but each judgment takes roughly twice as long, so you can cover far fewer conditions. Hold that trade in mind; the rest is detail.
ACR: show one clip, rate it on its own
Start with the method you will see most often. Absolute Category Rating (ACR) is the single-stimulus method defined in ITU-T P.910 §8.1: each clip is presented once, on its own, and the viewer rates it on a five-point quality scale immediately afterward. "Absolute" means the viewer judges the clip on its own merits, with no reference to compare against.
The scale is five labelled categories, not bare numbers (ITU-T P.910 §8.1):
| Rating | Label |
|---|---|
| 5 | Excellent |
| 4 | Good |
| 3 | Fair |
| 2 | Poor |
| 1 | Bad |
Average those ratings across the panel and you have the clip's Mean Opinion Score (MOS) — the single number explained in full in MOS, DMOS, and the rating scales. ACR is the workhorse of the field for one reason: it is the fastest method per rating, because each judgment consumes one clip and one vote. That throughput is why almost every large quality database and almost every crowdsourced test is built on ACR or its hidden-reference variant.
ACR has a close cousin worth naming here, because it is the most common method in practice. ACR with Hidden Reference (ACR-HR), defined in P.910 §8.6.2, runs an ordinary ACR test but quietly slips the pristine source clips into the set as unlabelled items. In analysis you subtract each viewer's rating of the hidden reference from their rating of the matching processed clip, which recovers a degradation score (a DMOS) while keeping ACR's single-stimulus speed. It is, in a sense, the bridge between the single- and double-stimulus worlds: the viewer never knows they are doing a comparison, but the analysis is a comparison anyway.
ACR's weakness is the flip side of its speed. Because the viewer never sees the original, ACR "may be insensitive to some impairments that are easily detected" when a reference is present (ITU-T P.910 §8.2.1) — a slight loss of detail or a faint colour shift can pass unnoticed with nothing to compare against. P.910 also quantifies ACR's resolving power: it "rarely yields" a reliably resolvable MOS difference below 0.5 points for 24 subjects, 0.7 for 15, 1.1 for 9, and 1.5 for 6 (ITU-T P.910 §8.1.1). If the difference you care about is smaller than that floor, ACR is the wrong tool, and you reach for a double-stimulus method.
DCR / DSIS: show the original, then rate the damage
When the question changes from "how good is this clip?" to "how faithful is this clip to the original?", you switch to Degradation Category Rating (DCR), defined in ITU-T P.910 §8.2 and known in the broadcast standard ITU-R BT.500-15 as the Double Stimulus Impairment Scale (DSIS). The viewer is shown the pristine source first, knows it is the source, then sees the processed clip and rates how impaired the second is relative to the first.
The scale changes from quality words to impairment words (ITU-T P.910 §8.2):
| Rating | Label |
|---|---|
| 5 | Imperceptible |
| 4 | Perceptible, but not annoying |
| 3 | Slightly annoying |
| 2 | Annoying |
| 1 | Very annoying |
DCR's strength is exactly where ACR is blind. With the original in view, the viewer can spot small fidelity losses that vanish in a single-stimulus test, which is why P.910 recommends DCR "when evaluating the fidelity of transmission with respect to the source signal," noting this is "frequently an important factor in the evaluation of high-quality systems" (ITU-T P.910 §8.2.1). Empirically, the gain is real: in a controlled comparison on 1080p H.264 content, DCR/DSIS produced smaller confidence intervals than ACR, especially at lower quality levels, while ACR remained the fastest method (Kawano et al., 2014).
The cost is throughput. Showing two clips per rating roughly halves the number of conditions you can cover in a session of the same length. DCR also measures impairment, not improvement: because 5 means "imperceptible" (identical to the source), the scale cannot express a processed clip that somehow looks better than its reference — a real situation with sharpening or super-resolution. For that you need the next method.
CCR / DSCS: rate the second clip against the first, up or down
Comparison Category Rating (CCR), defined in ITU-T P.910 §8.3 and also called the Double Stimulus Comparison Scale (DSCS), shows the viewer two clips — typically the processed clip and the reference, in a randomized order — and asks them to rate the quality of the second relative to the first on a seven-point comparison scale that runs in both directions (ITU-T P.910 §8.3):
| Rating | Label |
|---|---|
| +3 | Much better |
| +2 | Better |
| +1 | Slightly better |
| 0 | The same |
| −1 | Slightly worse |
| −2 | Worse |
| −3 | Much worse |
Two things make CCR distinct. First, the scale is symmetric around zero, so it can record that the second clip looked better than the first — the case DCR cannot express. That matters whenever the "processing" might add apparent quality: spatial upscaling, denoising, or AI super-resolution can all push a clip above its own source on a viewer's eye, and only a comparison scale captures it. Second, because the order of the two clips is randomized and the viewer rates the relationship, CCR is well suited to "comparing impairments that are nearly equal in quality" (ITU-T P.910 §8.3.1) — the situation where two encoders are so close that an absolute scale cannot separate them.
Recent crowdsourced evidence shows CCR's sensitivity advantage concretely. In a 2024–2025 P.910-compliant study, CCR was more sensitive than ACR-HR and "reflected quality improvements beyond the reference" for super-resolution upscaling, while ACR-HR showed compressed scale use — it failed to spread out the scores — particularly for fair-quality source content (Naderi & Cutler, P.910-Crowd; the comparative study, 2025). The price, again, is cost: that same study found ACR-HR was "approximately twice as fast and cost-effective" as CCR.
PC: forget the scale, just pick the better one
The most sensitive method asks the viewer to do the simplest possible thing. Pair Comparison (PC), defined in ITU-T P.910 §8.4 and called stimulus-comparison in ITU-R BT.500-15, shows two clips and asks one question: which one is better? There is no rating scale at all — just a forced choice (sometimes with a "no preference" option).
Stripping away the scale removes the hardest part of any rating task. A viewer who struggles to decide whether a clip is a "3" or a "4" can almost always tell you which of two clips they prefer, so Pair Comparison is the most discriminating method when the test items are very close in quality. It is the method of choice when you need to separate encoders that an absolute scale reports as tied.
Turning a pile of "A beats B" votes into a quality scale takes a model. The two standard choices are Thurstone's Law of Comparative Judgment, which converts preference probabilities into z-scores on a perceptual scale, and the Bradley-Terry model, a logistic model that yields scores often expressed in just-objectionable-difference (JOD) units. Both turn the win/loss tally into a one-dimensional ranking with real spacing, so Pair Comparison gives you an ordered scale even though no viewer ever used one.
Pair Comparison has one brutal weakness, and it is arithmetic, not perception. The number of distinct pairs grows with the square of the number of conditions — and on a real test that number gets out of hand fast. That explosion is worth working through in full, because it is the single fact that decides whether PC is usable for your test.
Figure 1. The one decision behind every method. ACR is single-stimulus (one clip, judged alone); DCR, CCR, and PC are double-stimulus (two clips, judged in relation). The question each asks is what changes.
The four scales, side by side
The method and its scale travel together, so it is worth seeing all four scales at once. ACR rates absolute quality on five quality words; DCR rates impairment on five impairment words; CCR rates a relationship on a seven-point scale that runs from "much worse" through "the same" to "much better"; and Pair Comparison has no scale, only a choice. The direction and shape of the scale are not cosmetic — they decide what the test can and cannot detect.
Figure 2. Four methods, four scales. The bidirectional CCR scale is the only one that records improvement above the reference; PC discards the scale entirely for a forced choice.
The continuous-scale cousins: DSCQS, SAMVIQ, and SSCQE
The four category methods above cover most multimedia testing, but broadcast television added a family of continuous-scale methods that ITU-R BT.500-15 still defines, and you will meet them in codec papers. They matter when the differences are so small that even a five-point category scale is too coarse.
The most important is the Double Stimulus Continuous Quality Scale (DSCQS). The viewer sees the reference and the test clip (often each shown twice), and rates both on a continuous line marked only with quality adjectives, with the identity of the reference hidden by randomizing the order. The score is the difference between the two ratings. Because the viewer rates on a continuous line rather than picking a category, DSCQS avoids forcing fine judgments into coarse bins, which makes it the classic choice "where the qualities of the test material and the original are similar" — the near-transparent, high-quality case (ITU-R BT.500-15). It is the historical gold standard for codec evaluation at high quality.
Two more round out the family. SAMVIQ (Subjective Assessment of Multimedia Video Quality), defined in P.910 §8.5, lets the viewer play several clips in any order against an explicit reference and score each on a continuous 0–100 scale — useful when you want random access and direct multi-way comparison. SSCQE (Single Stimulus Continuous Quality Evaluation), in BT.500, has the viewer move a slider continuously while watching long content, capturing how quality varies over time rather than a single number per clip — the method when the thing you care about is a quality drop midway through, not the average.
These continuous methods buy fine discrimination at the cost of being harder for viewers and slower to run. P.910 is blunt that simply adding scale points to a category method does not help: for continuous and many-point scales "the accuracy of the resulting MOS does not improve" while "the method becomes more difficult for subjects" (ITU-T P.910 §8.7.1). The continuous methods earn their place through their structure — the hidden reference in DSCQS, the time axis in SSCQE — not through extra scale resolution.
Where the two recommendations divide the work
Two ITU recommendations govern this territory, and people search for them by their exact slug, so it helps to know which owns what.
ITU-T P.910 (10/2023) is the controlling recommendation for multimedia video quality — the streaming, conferencing, mobile, and user-generated-content world most engineers work in. The 2023 edition is the consolidated one: it absorbed the former P.911 and P.913 (P.913 was formally deleted on 2 February 2024, its "any environment" methods folded into P.910), so today P.910 is the single current reference for ACR, DCR, CCR, PC, ACR-HR, and SAMVIQ. Cite this edition, not the 1999 or 2008 versions still floating around the web.
ITU-R BT.500-15 (05/2023) is the broadcast-television companion, organized into three parts: overall requirements and when to use which method, the methodologies themselves, and format-specific guidance. It is the home of DSIS (the broadcast name for DCR), DSCQS, single-stimulus methods, stimulus-comparison (pair comparison), SSCQE, and SDSCE, along with the canonical viewing-condition and screening rules that the multimedia world borrows. When a paper cites "BT.500," it is almost always for the viewing setup or the DSCQS/DSIS method.
The practical rule: if you are testing streaming, conferencing, mobile, or UGC, start from P.910; if you are testing broadcast-grade television or need DSCQS at near-transparent quality, reach for BT.500-15. The two overlap heavily and agree on the fundamentals; the method names mostly map across (DCR ≈ DSIS, CCR ≈ DSCS).
The three things that actually differ: sensitivity, throughput, and load
Strip away the names and the four methods differ along three axes that decide every real design choice.
Sensitivity is the smallest quality difference the method can reliably detect. Pair Comparison is the most sensitive because a forced choice is the easiest judgment to make; CCR and DCR come next because a visible reference exposes small fidelity losses; ACR is the least sensitive because the viewer has nothing to compare against. If your conditions are far apart in quality, ACR's lower sensitivity does not matter and its speed wins; if they are nearly tied, you need a comparison method or the test will report a meaningless tie.
Throughput is how many conditions you can rate per unit of viewer time. ACR leads by a wide margin — one clip, one vote — which is why it dominates large and crowdsourced tests. DCR and CCR roughly halve throughput by showing two clips per judgment. Pair Comparison is in a category of its own, because its cost does not grow linearly with the number of conditions; it grows with the square.
Participant load is how hard the task is for a viewer and how fast they fatigue. A forced choice is the lightest cognitive task; rating on an absolute five-point scale is heavier (the viewer must hold a mental model of "what a 3 means"); a continuous scale is heaviest. Load interacts with session length: P.910 limits sessions to control fatigue (ITU-T P.910 §11.1), so a heavier task means fewer ratings before the viewer's attention degrades.
One honest caveat keeps the trade from being a clean ranking. A 2003 meta-experiment found that a single-stimulus continuous test could match double-stimulus accuracy after proper statistical post-processing — meaning double-stimulus methods carry no inherent accuracy advantage, only a sensitivity-and-throughput trade-off (Pinson & Wolf, 2003). The eye does not become more truthful when you show it two clips; you simply give it an easier judgment. Choose the method that makes the judgment you care about the easy one.
Figure 3. The core trade-off. Moving from ACR toward Pair Comparison buys sensitivity and pays in throughput. There is no method in the top-right corner — that is why method choice is a choice.
Worked example: why Pair Comparison explodes
Here is the arithmetic that decides whether Pair Comparison is even possible for your test. Suppose you are comparing 12 encoder settings on one source clip and want each setting ranked.
With ACR, the count is simple: one rating per condition per viewer. For 12 conditions and 15 viewers, that is 12 × 15 = 180 ratings, each consuming one clip view.
With Pair Comparison, you must show every distinct pair. The number of pairs among N conditions is the combination formula:
pairs = N × (N − 1) / 2
For 12 conditions that is 12 × 11 / 2 = 66 pairs. If all 15 viewers judge every pair, that is 66 × 15 = 990 comparisons — and because each comparison shows two clips, the viewer watches 990 × 2 = 1,980 clip views, against ACR's 180. Pair Comparison costs about eleven times the viewing time to rank the same 12 items.
Now watch the curve bend. The pair count for a few panel sizes:
| Conditions (N) | ACR ratings/viewer | PC pairs/viewer | Ratio |
|---|---|---|---|
| 5 | 5 | 10 | 2× |
| 10 | 10 | 45 | 4.5× |
| 20 | 20 | 190 | 9.5× |
| 40 | 40 | 780 | 19.5× |
ACR grows in a straight line; Pair Comparison grows as the square. Put session limits on top and the impracticality is concrete: P.910 keeps sessions short to limit fatigue, so at roughly 25 seconds per comparison (view two 10-second clips, then choose), 990 comparisons run to about 412 minutes of viewing per viewer — far beyond any single sitting. This is why Pair Comparison is reserved for small condition sets, or run with active sampling algorithms (such as ASAP) that show only the most informative pairs instead of all N(N−1)/2 of them, cutting the count by an order of magnitude while preserving the ranking (Mikhailiuk et al., 2021).
Figure 4. Pair Comparison's cost grows as N(N−1)/2 — the square of the number of conditions — while ACR grows linearly. Active sampling is what keeps PC usable past a handful of conditions.
Choosing the method: a decision rule
Put the three axes together and the choice reduces to a short series of questions.
Start with the reference. Do you have the pristine original? If not — live streams, user-generated content, anything with no clean source — you cannot run any double-stimulus method, and a no-reference approach is your only option (covered in no-reference quality for live and UGC). If you do have the original, continue.
Next, the spread. Are the conditions far apart or nearly tied in quality? If they are clearly different — comparing a good encode against a bad one, or sampling a wide bitrate ladder — ACR's speed wins and its lower sensitivity costs you nothing. If they are nearly tied — two strong encoders, or a fine bitrate step near saturation — you need a comparison method.
Then, the direction. Could the processing make the clip look better than the source? Sharpening, denoising, and super-resolution all can. If yes, use CCR (its bidirectional scale records improvement) rather than DCR (whose scale tops out at "identical to source"). If you only expect degradation, DCR is the cleaner fidelity measure.
Finally, the count. How many conditions, and how tied are they? A handful of nearly identical conditions is the home ground of Pair Comparison — the most sensitive method, affordable because N is small. Many conditions rule PC out unless you use active sampling. And when you need the finest discrimination at near-transparent quality, DSCQS from BT.500 is the historical answer.
Figure 5. A decision rule for method choice. Reference availability gates everything; after that, the quality spread and the number of conditions point to ACR, DCR, CCR, or PC.
The methods compared
The table below is the reference to keep. Read the "where it falls short" column as carefully as the rest — every method buys its strength with a specific weakness.
| Method | Stimulus | Scale | Best for | Throughput | Where it falls short |
|---|---|---|---|---|---|
| ACR (P.910 §8.1) | Single | 5-pt quality (Excellent–Bad) | Many conditions, clearly different qualities; large/crowdsourced tests | Highest | Low sensitivity; misses small fidelity losses; content liking confounds the score |
| ACR-HR (P.910 §8.6.2) | Single + hidden ref | 5-pt, differenced to DMOS | ACR speed with bias removal | Highest | Compressed scale use on fair-quality sources; cannot show improvement above reference |
| DCR / DSIS (P.910 §8.2; BT.500) | Double (ref then test) | 5-pt impairment (Imperceptible–Very annoying) | Fidelity to a source; high-quality systems | ~Half of ACR | Only measures damage, not improvement; slower than ACR |
| CCR / DSCS (P.910 §8.3) | Double (randomized order) | 7-pt comparison (−3…+3) | Near-tied conditions; processing that may improve quality | ~Half of ACR | Slower and ~2× the cost of ACR-HR; needs careful order randomization |
| PC (P.910 §8.4; BT.500) | Double (forced choice) | None (which is better?) | A few nearly-identical conditions; separating tied encoders | Lowest (grows as N²) | Pair count explodes with conditions; needs Thurstone/Bradley-Terry scaling |
| DSCQS (BT.500-15) | Double, continuous | Continuous quality line, differenced | Near-transparent, high-quality codec comparison | Low | Heavy task; reference must be hidden by randomization |
The common mistakes
Method-choice errors are quiet: the test runs, the numbers look fine, and the conclusion is wrong. Five recur often enough to name.
Mistake 1 — using ACR to separate near-tied encoders. If two conditions differ by less than ACR's resolvable floor (about 0.5 MOS at 24 subjects, per P.910 §8.1.1), ACR will report a tie that is really a difference. Use CCR or Pair Comparison when the conditions are close.
Mistake 2 — using DCR when the processing can improve quality. The DCR impairment scale stops at "imperceptible" (identical to source), so it cannot record a super-resolution or sharpening result that looks better than its reference. Use CCR's bidirectional scale instead.
Mistake 3 — running full Pair Comparison on many conditions. N(N−1)/2 pairs become thousands of comparisons fast; a 40-condition full PC is 780 pairs per viewer. Subset the conditions or use active sampling before you commit a panel.
Mistake 4 — comparing scores across methods as if they were the same. An ACR MOS, a DCR impairment score, a CCR comparison mean, and a Bradley-Terry PC scale are different quantities on different scales. Never put them on one axis; compare within a method, not across.
Mistake 5 — citing the wrong recommendation edition. The 1999 and 2008 P.910 versions are still indexed by search engines; the current edition is P.910 (10/2023), which absorbed P.911 and P.913. Cite the edition you actually followed, and use the current one.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the method choice described here is the first decision we make when a quality question is genuinely contested. When a client asks "is the new encoder visibly better," and the two encoders are close, we run a comparison method (CCR or a small Pair Comparison) rather than ACR, because an absolute scale would report a tie and end the investigation prematurely. When the question is "does this restoration step actually help," we use CCR specifically, because only its bidirectional scale can record an improvement above the source. And when we publish numbers, we name the method, the recommendation edition, the screened panel size, and the confidence interval behind every figure — the discipline documented in our benchmark methodology — so the result survives the meeting it is quoted in.
What to read next
- MOS, DMOS, and the rating scales
- Designing a subjective test that survives scrutiny
- The statistics of subjective data
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your acr dcr pair comparison plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.910 (10/2023), "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, approved 29 October 2023. Tier 1 (official standard). The controlling recommendation: defines ACR (§8.1) and its sensitivity floor (§8.1.1), DCR/DSIS (§8.2), CCR/DSCS (§8.3), the Pair Comparison method (§8.4), SAMVIQ (§8.5), ACR-HR (§8.6.2), the number-of-levels guidance (§8.7.1), the environment and viewing conditions (§9), number of subjects (§10.1), and subject fatigue / session limits (§11.1). Consolidates the former P.911 and P.913. https://www.itu.int/rec/T-REC-P.910-202310-I/en
- Recommendation ITU-R BT.500-15 (05/2023), "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, Radiocommunication Sector, approved 28 May 2023. Tier 1 (official standard). The broadcast companion, in three parts (requirements, methodologies, format-specific): the home of DSIS, the Double Stimulus Continuous Quality Scale (DSCQS), single-stimulus methods, stimulus-comparison (pair comparison), SSCQE, and SDSCE, plus the canonical viewing-condition and screening rules. https://www.itu.int/rec/R-REC-BT.500
- Recommendation ITU-T P.913 (deletion notice), International Telecommunication Union; recommendation deleted 2 February 2024, content incorporated into ITU-T P.910 (10/2023). Tier 1 (official standard status). Recency anchor: the "any environment" subjective methods and the former P.911 are now consolidated into P.910 (10/2023) — cite the current edition. https://www.itu.int/rec/T-REC-P.913
- Recommendation ITU-T P.800.1 (07/2016), "Mean opinion score (MOS) terminology," International Telecommunication Union. Tier 1 (official standard). Standardizes the MOS terminology these methods produce, including the qualifiers that distinguish absolute-rating MOS from comparison results. https://www.itu.int/rec/T-REC-P.800.1
- T. Tominaga, T. Hayashi, J. Okamoto, A. Takahashi, "Performance comparisons of subjective quality assessment methods for mobile video," 2nd International Workshop on Quality of Multimedia Experience (QoMEX), 2010, pp. 82–87. Tier 5 (peer-reviewed). Direct measurement that ACR and ACR-HR are the most time- and cost-efficient general-purpose methods, with the shortest per-stimulus durations, on mobile HD and QVGA H.264 content. https://ieeexplore.ieee.org/document/5246962
- M. H. Pinson, S. Wolf, "Comparing subjective video quality testing methodologies," Visual Communications and Image Processing (VCIP) 2003, SPIE vol. 5150, pp. 573–582. Tier 5 (peer-reviewed). The meta-experiment showing single-stimulus continuous testing can match double-stimulus accuracy after post-processing — i.e., no inherent accuracy advantage for double-stimulus, only a sensitivity/throughput trade-off. https://its.ntia.gov/publications/download/spie03obj.pdf
- T. Kawano, K. Yamagishi, T. Hayashi, "Performance comparison of subjective assessment methods for stereoscopic 3D video quality," IEICE Transactions on Communications, vol. E97-B, no. 4, 2014, pp. 738–745. Tier 5 (peer-reviewed). On 1080p H.264 (2D and 3D), DCR/DSIS achieved smaller confidence intervals than ACR, especially at lower quality, while ACR remained the fastest and DSCQS was best only at very high quality. https://www.jstage.jst.go.jp/article/transcom/E97.B/4/E97.B_738/_article
- B. Naderi, R. Cutler, "A crowdsourcing approach to video quality assessment" (P.910-Crowd), and the 2025 follow-up "Comparative Study of Subjective Video Quality Assessment Test Methods in Crowdsourcing for Varied Use Cases," ICASSP 2024 / arXiv:2509.20118, 2025. Tier 5 (peer-reviewed). The current head-to-head: ACR-HR is ~2× faster and cheaper than CCR, while CCR is more sensitive and captures improvement beyond the reference; ACR-HR shows compressed scale use on fair-quality sources. https://arxiv.org/abs/2509.20118
- A. Mikhailiuk, C. Wilmot, M. Perez-Ortiz, D. Yue, R. Mantiuk, "Active Sampling for Pairwise Comparisons via Approximate Message Passing and Information Gain Maximization" (ASAP), International Conference on Pattern Recognition (ICPR), 2021. Tier 5 (peer-reviewed). The active-sampling approach that makes Pair Comparison practical past a few conditions by selecting only the most informative pairs instead of the full N(N−1)/2 set. https://arxiv.org/abs/2004.05691


