Our Video Codec Benchmark Methodology, Explained

Why this matters

Most codec benchmarks you find online cannot be trusted, and not because the authors are dishonest, but because the method is hidden. A vendor quotes "30% better" with no metric named, no encoder version, no content set, and no way to check it. This article is for the engineer, codec evaluator, or technical decision-maker who is about to read our codec and encoder comparisons and wants to know exactly how the numbers were made before believing them. It is also a working template: if you run your own benchmark, the rules here will keep you from fooling yourself. We publish the method first so the data that follows has a foundation to stand on.

Why the methodology comes before the results

A benchmark result is a claim about the world, and a claim is only as good as the evidence a reader can check. When a codec vendor says one encoder beats another, the useful question is never "by how much?" — it is "measured how, on what, with which settings?" Without those answers, the headline number is marketing, not measurement.

This block — Fora Soft's benchmarks — exists to publish numbers that other engineers can cite, and a number is only citable if it is reproducible. So we wrote the method down first. Everything in the codec and encoder comparisons that follow (codec comparison on real content, encoder comparison, how results change by content type, and BD-rate explained with our numbers) obeys the rules defined here.

The standards world solved this problem long ago with the idea of common test conditions (CTC) — a fixed, published recipe that says exactly which sequences, which settings, and which measurements every comparison must use, so two labs running the same test get comparable answers. The Alliance for Open Media publishes a CTC for AV1; the Joint Video Experts Team publishes one for its codecs (AOM CTC; JVET-J1010, 2018). Our methodology adapts that discipline to a streaming-realistic setting. The principle is the same: write the recipe down, then follow it every time.

End-to-end benchmark pipeline from source masters through encoding, decoding, measurement, pooling, and BD-rate to a provenance manifest Figure 1. The benchmark pipeline. Every stage is pinned and disclosed: the source masters, the encoder versions and settings, the decode-and-upscale step, the metric and model, the pooling with its confidence interval, the BD-rate computation over the overlapping quality range, and the provenance record that lets a reader reproduce the whole run.

The content set: real footage, not just the usual test clips

Quality results depend on the content, so the first decision in any benchmark is what to measure on — and it is the decision most often made badly. A codec that wins on a clean, slow-moving test clip can lose on grain, fast motion, or screen text. Measure on the wrong content and the headline number is true only for that content.

The common shortcut is to use the standard academic test sequences — short, pristine clips that everyone has, which makes results comparable across papers. We use a set of those reference clips for exactly that reason: they let anyone line our numbers up against published work. But standard clips alone do not represent what a real product encodes. So we also measure on real client content — the kind of footage our streaming, conferencing, e-learning, OTT, and surveillance projects actually handle — used with permission and reported only as aggregate numbers, never as identifiable frames.

That content set spans the categories that stress codecs in different ways: live action film and television, animation, high-motion sport, screen and presentation content, talking-head conferencing, and user-generated video. Each behaves differently under compression, which is the whole point of how results change by content type. A few rules keep the set honest:

The source is a pristine master. Every clip is a high-quality mezzanine or original, not a re-encode. Benchmarking against an already-compressed source measures the wrong thing — you would reward an encoder for faithfully copying someone else's artifacts.
The color format and resolution are fixed and stated. We report the chroma sampling (4:2:0 unless stated), the bit depth, the resolution, and the frame count for every run, because a metric comparison is only valid when these match.
The clips are long enough to be representative. Very short clips exaggerate start-up and scene-cut effects; we use durations that average over enough shots to be stable.
The set is documented. Each result names the content category and the number of clips behind it, so a reader knows whether a number rests on three clips or thirty.

The encoders and settings: pinned, disclosed, and identical across the comparison

An encoder is not one thing — it is a specific build with specific settings, and changing either changes the result. So we pin both and publish both. The exact version of every encoder is recorded: the H.264 encoder x264, the HEVC encoder x265 (4.2 at the time of writing), the AV1 encoder SVT-AV1 (4.x in 2026), and any others a given comparison includes. When these tools release new versions, the numbers can move, which is why the version is part of the result, not a footnote.

Settings matter as much as the build. The single most common way a codec comparison goes wrong is a mismatch in encoder effort — comparing a slow, high-effort preset of one encoder against a fast preset of another, then reporting the quality difference as if the codec caused it. We hold the comparison apples-to-apples: the same rate-control mode, comparable speed presets chosen for the use case, the same group-of-pictures structure where the codecs allow it, and the same handling of resolution. The full command line for every encode is recorded in the provenance manifest, so there is no hidden flag.

We benchmark at a ladder of operating points, not a single bitrate. Each clip is encoded at several quality or rate settings — typically a spread of constant-rate-factor or fixed-quantizer points — so we can draw a rate-quality curve rather than compare two isolated dots. Standard CTCs use four such points (quantizer values 22, 27, 32, 37 in the JVET conditions) or six (the AOM conditions add lower and higher points); we use enough points to trace the curve across the quality range a streaming service actually ships, and we say how many. The reason for a curve rather than a point is the subject of the next two sections — it is what makes a fair codec comparison possible at all.

Note for codec internals. This article is about measuring codecs, not about how they compress. For how H.264, HEVC, and AV1 actually work, and how the encoder implementations differ, see the Video Encoding section's codec comparison and encoder implementations articles. We link to the cause side and stay on the measurement side.

The metrics and models: named, pinned, and honest about blind spots

A quality score means nothing until you name the metric, the model, and the conditions. "VMAF 95" is not a measurement; "VMAF 95 on the default v0.6.1 model, 1080p, mean-pooled" is. So every score we publish carries that full specification.

Our primary perceptual metric is VMAF (Video Multimethod Assessment Fusion), Netflix's metric that fuses several quality features into a 0–100 score trained to predict human opinion. VMAF ships several models, and the model is part of the measurement: the default model predicts quality on a 1080p living-room television, the 4K model predicts it on a 4K television viewed at 1.5 times the screen height, and the phone model predicts it on a handset, where the same encode scores higher because impairments are harder to see on a small screen (Netflix VMAF documentation; models.md). We state which model every number uses, because reporting a phone-model score as if it were a television score overstates quality. The deeper treatment of model selection lives in VMAF in depth; here the rule is simply that we always name the model and version.

We do not report VMAF alone. Alongside it we compute PSNR (Peak Signal-to-Noise Ratio, a measure of pixel error in decibels) and SSIM (Structural Similarity, a 0–1 measure of structural fidelity), because each metric has a different blind spot and seeing them together catches errors a single number hides. Every objective metric is a proxy validated against human scores, and when a metric and a careful human viewing disagree, the viewing wins — that is the whole basis of validating metrics against human scores and of subjective testing as the ground truth. We name where each metric lies for the content under test, because every metric lies somewhere.

One measurement rule is easy to get wrong and quietly invalidates a benchmark: a full-reference metric needs both frames at the same size. When we encode a clip at a lower resolution to test a ladder rung, we decode it and upscale it back to the source resolution before scoring, using a stated upscaler held constant across the whole comparison. Scoring a 540p encode against a 1080p source without upscaling does not measure quality; it measures a size mismatch. The same discipline underlies the convex hull that ties resolution to bitrate.

What we pin	What we record	Where it lies if you skip it
Source master	Resolution, chroma (4:2:0), bit depth, frame count	A pre-compressed source rewards copying existing artifacts
Encoder build	Exact version (e.g. x265 4.2, SVT-AV1 4.x)	New encoder versions move the numbers silently
Encoder settings	Full command line, rate-control mode, preset	Mismatched presets make a speed gap look like a codec gap
Quality metric	Metric + model + version (e.g. VMAF default v0.6.1)	A phone-model score read as a TV score overstates quality
Resolution handling	Upscaler, applied before scoring	Scoring across sizes measures the mismatch, not quality
Pooling	Mean / harmonic / percentile + confidence interval	The mean hides the worst seconds the viewer remembers

Table 1. The apples-to-apples controls. The left column is what we hold fixed across a comparison; the right column is the error you get if you do not. A benchmark is only valid when every row matches between the encoders being compared.

Comparison table of the benchmark controls we pin, what we record for each, and the error introduced if the control is skipped Figure 2. The same controls as a visual reference — what we pin, what we record, and where each one lies if it is skipped.

Pooling and confidence: one number, plus how sure we are of it

A metric produces a score for every frame, and the summary number you report depends entirely on how you combine them — a step called pooling that is easy to do in a way that flatters the result. The arithmetic mean is the default, and it is also the most forgiving: it lets a few excellent seconds paper over a stretch of bad ones. Because a viewer remembers the worst moment, not the average, we report a low percentile or the harmonic mean alongside the mean, so the worst seconds are visible. The full treatment of why this matters is pooling per-frame scores into one number; the rule here is that we always state the pooling method and never report the mean alone.

We also report how sure we are of a VMAF number, because a score trained on human opinion carries uncertainty. VMAF can attach a 95% confidence interval to each prediction, computed by bootstrapping — training many models on resampled training data and measuring how much their predictions vary (Netflix VMAF documentation, conf_interval.md, since v1.3.7, 2018). With a bootstrap model, the 95% interval is roughly the score plus or minus 1.96 times the standard deviation of those predictions. The practical consequence is a discipline we apply without exception: if two encoders' intervals overlap, we do not declare a winner. A 0.4-VMAF difference inside a ±1.5-VMAF interval is noise, and calling it a win is the most common way a benchmark misleads.

BD-rate: the right way to say "X% smaller at the same quality"

The headline number in any codec comparison is the bitrate saving at equal quality, and the standard way to express it is BD-rate (Bjøntegaard Delta rate) — the average percentage difference in bitrate between two codecs at matched quality, across a range of quality (Bjøntegaard, VCEG-M33, 2001). The sign convention matters: a BD-rate of −40% means the test codec needs 40% less bitrate than the anchor codec to reach the same quality. It is a saving, not a quality score — a point worth repeating because the two are constantly confused.

The idea is geometric. Plot each codec's rate-quality curve — bitrate on a logarithmic axis, quality on the vertical axis — and BD-rate is the average horizontal gap between the two curves over the quality range where they overlap. "Horizontal gap" is the key phrase: you read the bitrate difference at a fixed quality, not the quality difference at a fixed bitrate.

Two rate-quality curves with the overlapping quality band shaded and horizontal arrows showing the bitrate saving at equal quality that BD-rate averages Figure 3. BD-rate as the average horizontal gap. The anchor and test curves are compared only over the quality range both reach (the shaded band); at each quality level the horizontal distance is the bitrate saving, and BD-rate is the average of those savings in the log-bitrate domain.

A worked example

Take one clip, encoded at four operating points with each codec, scored with VMAF on the default model. The numbers below are illustrative, chosen to make the arithmetic legible — the real measured values appear in the codec comparison, not here.

VMAF (matched quality)	Anchor (x264) bitrate	Test (SVT-AV1) bitrate	Saving
82	1000 kbps	500 kbps	50%
90	2000 kbps	1000 kbps	50%
95	4000 kbps	2000 kbps	50%
98	8000 kbps	4000 kbps	50%

Table 2. An illustrative rate-quality comparison. At each matched VMAF level the test codec uses about half the bitrate of the anchor. Numbers are synthetic, for demonstration only.

At each matched quality level the test codec uses half the bitrate, so the bitrate ratio is 0.5. BD-rate averages that ratio in the logarithmic domain, then converts back:

log10(0.5)            = -0.301        (the log-domain gap, constant here)
average over overlap  = -0.301        (same at every quality level)
BD-rate = 10^(-0.301) - 1
        = 0.5 - 1
        = -0.50  →  -50%

A BD-rate of −50%: the test codec reaches the same quality at half the bitrate, across the measured range. Real curves are never this clean — the saving varies with quality, the curves are not parallel, and they rarely overlap across the full range — which is exactly why the computation has to be done carefully rather than eyeballed.

Doing the computation right

The original 2001 method fit a single cubic curve through the points. Later work, including the metric author's own follow-up and recent academic analysis, showed that a plain cubic can overshoot between points and distort the result, and that two fixes matter (Bjøntegaard, VCEG-AL22, 2008; Herglotz et al., "The Bjøntegaard Bible," 2023). We apply both:

Use a shape-preserving interpolation, not a plain cubic. A piecewise cubic that cannot overshoot its data points (PCHIP, or the similar Akima method) avoids the spurious wiggles a single cubic introduces. The Bjøntegaard Bible analysis found a plain cubic spline "can lead to overshoots (Runge's phenomenon)... hence CSI should not be used in practice," and recommends the monotone alternatives. We report which interpolation we used.
Integrate in the log-bitrate domain. Averaging the bitrate difference in the logarithm keeps the result from being dominated by the high-bitrate end, where a few hundred kbps swamps everything below it.

Two more rules guard against the errors that quietly break BD-rate:

Only compare over the overlapping quality range. BD-rate is defined on the quality interval both codecs actually reach; extending it by extrapolation produces large errors. The Bible reports a case where poor overlap turned a true −43% into −49% — a six-point error from a single non-overlapping point. We check the overlap explicitly and report it.
A small BD-rate can be inside the noise. When the measured saving is smaller than the method's own error for that content and metric, we say so rather than report a decisive-sounding number. For saturating metrics like VMAF and SSIM, we also apply the log transform the research recommends, because those metrics flatten out at the top of their range and distort the curve fit otherwise.

The downloadable toolkit below computes BD-rate this way — shape-preserving interpolation, log-domain integration, an explicit overlap check, and the VMAF log transform — and reproduces the worked example above exactly.

A common mistake: the benchmark that proves whatever you want

The classic bad benchmark is not faked; it is simply unfair in a way the reader cannot see. Someone compares the slow, high-effort preset of their preferred encoder against a fast preset of the competitor, scores with whichever metric flatters their codec, reports a single mean that hides the content where they lose, picks bitrates where the curves barely overlap, and quotes one number with no version and no confidence interval. Every individual choice is defensible in isolation; together they manufacture a conclusion.

The defense is the discipline in this article, and it is worth stating as a checklist a reader can apply to anyone's benchmark, including ours: name the content, pin and disclose the encoder versions and settings, hold the effort level comparable, name the metric and model, pool with the worst frames visible, compute BD-rate over the real overlap with a shape-preserving fit, and refuse to call a winner inside the confidence interval. A benchmark that cannot answer these questions is an opinion wearing a number's clothes. Reading any quality report this way is the skill taught in reading a quality-metric report without fooling yourself.

Provenance: every result carries its own receipt

The single thing that separates a citable benchmark from a forgettable one is provenance — the record that lets someone else reproduce the number. So every result we publish ships with a manifest that states the content set and clip count, the exact encoder versions and full command lines, the metric, model, and version, the pooling method and confidence interval, the resolution handling, the hardware, and the date. The date matters because encoders and metrics both evolve; a benchmark is a snapshot, valid for the versions and settings it names, and re-validated when those change.

Provenance manifest card listing the content set, encoder versions and commands, metric model and version, pooling, hardware, and date that every published result carries Figure 4. The provenance record. A benchmark number is citable only when a reader can reconstruct it; this manifest is the receipt attached to every result in the block, and the downloadable toolkit generates and checks it.

Where Fora Soft fits in

Fora Soft has built video streaming, OTT, conferencing, e-learning, telemedicine, and surveillance software since 2005, and every one of those products lives or dies on the encoding and delivery choices behind it. We measure delivered quality the same way we ask you to trust it: on real content, with disclosed encoders and settings, against named metrics and models, with the worst frames and the confidence interval in view. The benchmarks in this block are not a literature review — they are our own measurements on the kind of footage our clients actually ship, which is why they are worth citing and why the method is published in full. When a quality target has to be defended to a CDN budget or a product owner, this is the discipline that makes the number hold up.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your video codec benchmark methodology plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.

References

G. Bjøntegaard. "Calculation of Average PSNR Differences Between RD-Curves." ITU-T SG16 Q.6 VCEG, document VCEG-M33, Austin, TX, USA, April 2001. Tier 1 (metric author's defining document). The original BD-rate / BD-PSNR method: fit a curve through rate-quality points and report the average bitrate difference at equal quality. Basis for the BD-rate definition and sign convention. https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc
G. Bjøntegaard. "Improvements of the BD-PSNR Model." ITU-T SG16 Q.6 VCEG, document VCEG-AL22, Berlin, Germany, July 2008. Tier 1 (metric author follow-up). Refines the original method toward piecewise interpolation with logarithmic bitrate for more stable results. Basis for the log-domain and shape-preserving-fit rules. https://www.itu.int/wftp3/av-arch/video-site/0807_Ber/VCEG-AL22.doc
C. Herglotz, H. Och, A. Meyer, et al. "The Bjøntegaard Bible — Why Your Way of Comparing Video Codecs May Be Wrong." IEEE (accepted December 2023); arXiv:2304.12852. Tier 5 (peer-reviewed). Analyzes BD computation: cubic splines overshoot and should be avoided (use PCHIP/Akima), saturating metrics (VMAF, SSIM) need a log transform, the quality ranges must overlap (check intersection-over-union), and a few supporting points suffice for rough estimates. Basis for the "doing the computation right" section and the overlap-error figure (−43% vs −49%). https://arxiv.org/pdf/2304.12852
Netflix. "VMAF — Confidence Interval" (conf_interval.md), VMAF Development Kit documentation. Accessed 2026-06-25. Tier 1 (metric author documentation). Since v1.3.7 (June 2018) each VMAF prediction can carry a 95% confidence interval computed by bootstrapping; with a bootstrap model the 95% CI is approximately the score ± 1.96 × the bootstrap standard deviation. Basis for the confidence-interval and overlapping-interval rules. https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md
Netflix. "VMAF Models" (models.md) and VMAF FAQ, VMAF Development Kit documentation. Accessed 2026-06-25. Tier 1 (metric author documentation). The default v0.6.1 model targets a 1080p living-room TV; the 4K model targets a 4K TV at 1.5× screen height; the phone model targets handset viewing; the NEG model resists enhancement gaming. Basis for the model-selection rule. https://github.com/Netflix/vmaf/blob/master/resource/doc/models.md
Z. Li, C. Bampis, J. Novak, et al. "VMAF: The Journey Continues." Netflix Technology Blog, 2018. Tier 4 (credible deployer engineering blog). How VMAF fuses features trained on subjective scores, why a single metric needs uncertainty quantification, and how Netflix deploys it at scale. Basis for the VMAF framing. https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12
Alliance for Open Media. "AV1 Common Test Conditions (CTC)." Accessed 2026-06-25. Tier 1 (standards-body methodology). Defines the AV1 benchmark recipe: a fixed clip set across resolutions in 4:2:0, natural-video and screen-content categories, and a set of CRF/QP supporting points {17,22,27,32,37,42} for BD-rate. Basis for the common-test-conditions discipline and the operating-point ladder. https://aomedia.org/docs/CWG-B075o_AV2_CTC_v2.pdf
JVET (ITU-T/ISO/IEC). "JVET Common Test Conditions and Software Reference Configurations," document JVET-J1010, 2018. Tier 1 (standards-body methodology). The CTC for VVC-era codec testing: test-sequence classes, random-access GOP-16 configuration, and the four QP points {22,27,32,37}. Basis for the operating-point and apples-to-apples rules. https://www.itu.int/wftp3/av-arch/jvet-site/2018_04_J_SanDiego/JVET-J1010.zip
Recommendation ITU-T P.1401 (07/2020). "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models." International Telecommunication Union, 2020. Tier 1 (official standard). The procedure for evaluating an objective metric against subjective scores — Pearson correlation, Spearman rank correlation, RMSE, and the rule that a difference smaller than the uncertainty is not a difference. Basis for the metric-validation and noise-threshold rules. https://www.itu.int/rec/T-REC-P.1401
Recommendation ITU-R BT.500-15 (2023). "Methodologies for the subjective assessment of the quality of television pictures." International Telecommunication Union, 2023. Tier 1 (official standard). The subjective ground truth every objective metric is validated against; when a metric and a careful viewing disagree, the viewing wins. Basis for the measurement-honest framing. https://www.itu.int/rec/R-REC-BT.500
Recommendation ITU-T P.910 (2023). "Subjective video quality assessment methods for multimedia applications." International Telecommunication Union, 2023. Tier 1 (official standard). Defines the subjective test methods (ACR, DCR, PC) and viewing conditions used to validate objective metrics. Basis for the subjective-validation reference. https://www.itu.int/rec/T-REC-P.910
Y. Reznik, et al. "Revisiting Bjøntegaard Delta Bitrate (BD-BR) Computation for Codec Comparison." ACM Mile-High Video (MHV), 2022. Tier 5 (peer-reviewed). Reviews practical BD-BR computation choices — interpolation, integration domain, and supporting-point placement — for streaming codec comparison. Basis for the BD-rate implementation choices. https://www.reznik.org/papers/MHV22_BD_BR-CameraReady.pdf
FFmpeg. "libvmaf, psnr, and ssim filter documentation." FFmpeg Filters Documentation, accessed 2026-06-25. Tier 3 (first-party tooling). The filters that compute VMAF, PSNR, and SSIM in a reproducible command line, including model selection and the scaling that full-reference metrics require. Basis for the measurement-tooling and upscaling rules. https://ffmpeg.org/ffmpeg-filters.html

Why this matters

Why the methodology comes before the results

The content set: real footage, not just the usual test clips

The encoders and settings: pinned, disclosed, and identical across the comparison

The metrics and models: named, pinned, and honest about blind spots

Pooling and confidence: one number, plus how sure we are of it

BD-rate: the right way to say "X% smaller at the same quality"

A worked example

Doing the computation right

A common mistake: the benchmark that proves whatever you want

Provenance: every result carries its own receipt

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Our Video Codec Benchmark Methodology, Explained

Why this matters

Why the methodology comes before the results

The content set: real footage, not just the usual test clips

The encoders and settings: pinned, disclosed, and identical across the comparison

The metrics and models: named, pinned, and honest about blind spots

Pooling and confidence: one number, plus how sure we are of it

BD-rate: the right way to say "X% smaller at the same quality"

A worked example

Doing the computation right

A common mistake: the benchmark that proves whatever you want

Provenance: every result carries its own receipt

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

VMAF

BD-rate

Confidence interval

Pooling

PSNR

Preset

SSIM

FFmpeg