Mode Decision and Rate-Distortion Optimization (RDO)

Why this matters

An encoder preset is mostly a recipe for how aggressively to run RDO, so understanding mode decision is the difference between picking a preset by superstition and picking one because you know what it actually does. A product manager who knows that x265's --rd 5 adds rate-distortion optimized quantization, where --rd 3 does not, will stop assuming that "higher is always better" and start asking whether the extra ~20% CPU buys the few percent of bitrate that matters at their scale. A streaming engineer who can read an encoder log and see how many of the 35 intra-prediction modes were short-listed by SATD before full RDO will know exactly where the speed went. A founder who is comparing AV1 vendors will recognise that two AV1 encoders running at the same preset number can do completely different mode searches and produce different files. This article walks through what mode decision is, why RDO is the mathematically correct way to choose, how lambda is computed from QP, how every shipping encoder approximates the full search to stay fast, and what the practical knobs are.

What "mode" means in a modern codec

A modern video encoder doesn't compress a frame as one blob. It splits the frame into a quadtree of coding blocks (HEVC's CTUs of up to 64×64, AV1's superblocks of 128×128, VVC's CTUs of 128×128) and then, inside each block, it picks coding choices independently. The collection of choices for one block is what we call its mode. A mode is a small bundle of decisions:

How to predict the pixels. Use a neighbour in the same frame (intra), or copy a patch from another frame (inter). Inside intra, pick one of 35 directions in HEVC, 56 in AV1, or 67 in VVC. Inside inter, pick one or two reference frames, a motion vector for each, and a sub-pixel interpolation filter.
How to split the block. Keep it as one large block, or partition into four smaller ones; HEVC has square partitions only, AV1 has square plus a handful of rectangular ones, VVC has a much richer set including ternary splits.
Which transform to apply to the prediction residual. DCT-II by default, but HEVC offers an alternate transform for 4×4 intra blocks, AV1 offers DCT, ADST and identity in 16 combinations, and VVC offers Multiple Transform Selection (MTS) and Low-Frequency Non-Separable Transform (LFNST) on top.
How aggressively to quantise each transform coefficient. The block QP can be modified relative to the slice QP; RDOQ (rate-distortion optimized quantization) can replace the rounded coefficient with a slightly different value that codes more cheaply.

Every one of those decisions is a discrete choice, and the encoder must make them all before it knows the final bit cost and final reconstruction quality of the block. The number of plausible combinations per block is enormous — in VVC, the raw count of legal modes inside one CTU runs into the hundreds of millions if you really enumerate everything. An encoder that wants to ship a real-time stream cannot test all of them. RDO is the framework that says how to compare any two candidates and pick the better one; the practical encoder is then a careful set of shortcuts that prunes the search down to thousands of candidates per CTU, not hundreds of millions.

Figure 1. Inside a coding block, the encoder tests a fan of candidate modes. Each candidate produces its own (R, D) pair. RDO collapses the pair into a single number J = D + λR. The winner is the candidate with the lowest J.

The rate-distortion problem in one paragraph

Compression always involves a trade. Spend more bits and the reconstruction is closer to the original; spend fewer bits and the reconstruction drifts further away. A plot of distortion against bitrate, called a rate-distortion curve, is monotonically decreasing — quality goes up as bitrate goes up — and a good encoder is one that operates on the lower-left envelope of that curve for any given content. The mode-decision problem is exactly the problem of staying on that envelope: out of all the encodings of one block that the codec syntax allows, pick the one that sits on the envelope rather than above it.

Mathematically this is a constrained optimisation. You want to minimise total distortion D across the picture subject to total bitrate R ≤ R_target. The constrained form is hard because choices in one block constrain choices in another (they share a target). The trick that makes the problem tractable, due to Sullivan and Wiegand in their 1998 IEEE Signal Processing Magazine paper "Rate-distortion optimization for video compression", is to convert the constrained problem into an unconstrained one using a Lagrange multiplier:

J = D + λ · R

J is the rate-distortion cost of one coding choice. D is its distortion (typically sum of squared differences, SSD, between the original block and the reconstructed block). R is its rate in bits. λ is a single positive number — the Lagrange multiplier — that fixes the slope of the trade. When λ is small, the term λR is small, so the encoder mostly minimises D — it spends bits freely to chase quality. When λ is large, the term λR dominates, so the encoder mostly minimises R — it accepts more distortion in exchange for fewer bits. There is one specific λ that, for any given content, lands you on the rate-distortion envelope. Pick λ, then choose the mode that minimises J in every block, and you have the lowest total distortion at the rate that this λ produces.

The deep reason this works is convexity. As you sweep λ from zero to infinity, the optimal operating point traces out the convex hull of the rate-distortion curve. Every point on that hull is reachable by some λ. So instead of running a hard global search, the encoder runs an easy local search at every block — and as long as it picks the right λ globally, the local decisions add up to a globally near-optimal encoding. The construction is exact for convex problems and a very good heuristic for the non-convex real one.

Lambda — the one number that controls everything

Every modern encoder picks λ as a function of the quantisation parameter QP. The relationship is exponential: as QP goes up by 6, the quantisation step roughly doubles, so the encoder is twice as aggressive about throwing detail away, so λ also roughly doubles. The HEVC reference encoder HM uses

λ_mode = α · 2^((QP − 12) / 3)

where α is a content-dependent factor — 0.85 for the Random Access configuration, lower for I-frames, slightly higher for B-frames at deep temporal levels. The same family of formulas, with different constants, ships in x264, x265, VVenC, libvpx, libaom, SVT-AV1 and VVC's VTM. The Wiegand-Girod paper "Lagrange multiplier selection in hybrid video coder control" (2001) is the classical reference; the Frontiers 2023 survey "The disparity between optimal and practical Lagrangian multiplier estimation in video encoders" is a current restatement of the same story.

A worked example. Suppose you are encoding at QP = 27, which is a common streaming QP for 1080p H.264.

QP − 12 = 15.
(QP − 12) / 3 = 5.
2^5 = 32.
λ_mode = 0.85 × 32 = 27.2.

So at QP = 27, the encoder weighs every bit of rate against about 27 units of squared-error distortion. If candidate A costs 100 bits more than candidate B but reduces SSD by 3000, then for A the extra cost in J is 100 × 27.2 = 2720, the saving in J is 3000, and A wins by 280. If A only reduced SSD by 2500, A would lose by 220 — even though A is the higher-quality block — because the extra 100 bits weren't worth it at this λ.

A second formula, λ_motion = √λ_mode, is used during motion estimation, where distortion is measured in sum-of-absolute-differences (SAD) units rather than SSD units; SAD scales roughly like the square root of SSD, so √λ_mode lines the two up. Concretely, λ_motion for the example above is √27.2 ≈ 5.21 — the motion-estimation search will prefer a candidate motion vector that costs 1 extra bit only if it cuts SAD by at least about 5 absolute-difference units.

The numerical sensitivity of λ is what makes it a real engineering concern. A vendor who decides to multiply λ by 1.2 across the board ("save bitrate at high QP") will trade away a measurable amount of PSNR; a vendor who divides λ by 1.2 will eat extra bits without buying enough quality. Production encoders ship with carefully tuned λ tables, and most encoder tunings — --tune psnr, --tune ssim, --tune vmaf — are at heart small adjustments to the λ that the encoder uses inside its RDO loops.

Figure 2. The geometric meaning of lambda. On the rate-distortion curve (left), -λ is the slope of the tangent line at the optimal operating point. The relationship between λ and QP (right) is exponential — every increase of 6 in QP roughly quadruples lambda.

The full-RDO loop, step by step

A textbook full-RDO mode decision for a single block runs the following loop. The notation follows the HEVC HM reference encoder, but every modern encoder does some version of this.

Enumerate candidate modes. Build a list of all the modes the syntax allows for this block. For a 32×32 inter block in HEVC, the list includes 35 intra directions, two skip candidates, several merge candidates, multiple reference-frame and motion-vector combinations, four partition splits, and so on — hundreds of entries.
For each candidate: a. Run the actual encoding tools. Generate the prediction. Subtract from the original to get the residual. Transform the residual (apply DCT-II, or whatever this candidate selects). Quantise the transform coefficients with the current QP. b. Apply RDOQ if this RD level allows it: instead of straight rounding, try a few neighbouring integer levels for each significant coefficient and keep the one that minimises J = D + λR_coeff. c. Entropy-code the quantised coefficients with the actual context-adaptive coder (CABAC for HEVC and VVC; CABAC-like coders for AV1 and AVS3). Count the actual bits R produced. d. Inverse-quantise the coefficients, inverse-transform the residual, add the prediction back, and compute the reconstructed block. Measure SSD against the original to get D. e. Compute J = D + λ · R.
Pick the candidate with the lowest J. Commit that mode, write its bits to the bitstream, save its reconstruction to the reference buffer.

This loop is exact in the sense that every cost is the real cost: real bits from a real entropy coder, real distortion from a real reconstruction. Pure RDO is what you get when you set x265 to --rd 6 with --no-fast-intra and a tight --rd-refine, or when you run the VVC reference encoder VTM with the default Random Access configuration. It is also what no one ships in production. A single full-RDO 4K encode on a modern CPU can run hundreds of times slower than playback — VTM at default settings encodes 4K video at well under 1 frame per minute on a 16-core workstation. The shortcuts that follow are what makes shipping software possible.

How real encoders cheat — fast mode decision

A production encoder has a quality budget (don't lose more than X% BD-rate against full RDO) and a speed budget (run faster than Y times real-time on Z cores). Mode decision is where almost all the speed-vs-quality trade-off lives. Six families of shortcuts dominate.

Pre-filtering candidates with a cheap cost. Before running full RDO on a candidate, the encoder computes a cheap surrogate cost — typically the Sum of Absolute Transformed Differences (SATD), which is the L1 norm of the Hadamard-transformed residual. SATD is roughly 50× cheaper than full RDO and correlates well with the final rate-distortion cost. The encoder ranks all candidates by SATD, keeps the top-K (K is typically 3–8), and only runs full RDO on those. This is the dominant trick in fast intra mode decision: instead of running full RDO on all 35 HEVC intra directions, x265 runs SATD on all 35, keeps the best 3, and runs full RDO only on those 3. Quality loss is typically below 0.2% BD-rate; speedup is roughly 10×.

Early termination on the parent block. When the encoder is deciding whether to split a 64×64 CTU into four 32×32 children, it first encodes the whole 64×64 with the best non-split mode. If the resulting J is "good enough" — below a content-adaptive threshold derived from neighbour costs — the encoder skips the recursive search and commits the 64×64 mode. The same trick applies at every level of the quadtree. x265 calls this early-CU termination; the threshold is tunable with --limit-modes, --limit-refs, and --early-skip.

Spatial inheritance. Neighbouring blocks are correlated. If the block to the left chose the horizontal intra direction, the current block is much more likely to also choose horizontal. Most encoders bias the search toward neighbours: at speed-optimised presets, they only test the modes that neighbours chose, plus a small handful of "always test" defaults. The bias is unsafe near edges and at scene cuts, so encoders override it under detected scene changes.

Lookahead and complexity classification. A separate pass running tens of frames ahead of the encoder classifies each frame's complexity. Simple flat content with little motion gets a lighter mode search; complex high-motion content gets a heavier one. x264 calls this mb-tree; x265 calls it CU-tree. The classifier is the reason a static webcam stream encodes at a fraction of the cost of a sports broadcast at the same resolution.

Multi-pass tuning. Two-pass and three-pass encoders use the first pass to learn each region's complexity and then run a tighter RDO loop in the final pass, spending more cycles where it matters. Production VOD pipelines (Netflix, YouTube, Disney+) run as many as 6 passes per title; live encoders run a single-pass with aggressive lookahead instead.

Per-block QP modulation. Within RDO, the encoder can adjust QP up or down at the block level (within the limits the syntax allows). Lower QP, higher λ-relative-to-distortion, more bits spent — used on faces, on regions identified as visually important by saliency models, on the centre of the picture for VR content. This is the foundation of per-title and per-scene encoding used in modern VOD pipelines.

The combined effect is a search that is hundreds of times faster than full RDO and gives up only a few percent of BD-rate. The fast-preset tax on x265 medium (the default) is roughly 4–6% BD-rate against --preset placebo; the tax on x265 ultrafast is roughly 25–30%. Inside SVT-AV1 the same pattern holds: preset 8 is roughly 4% behind preset 4 (Streaming Learning Center benchmarks), and preset 12 is 15–20% behind. Most of that delta lives in mode decision; transform, quantization and entropy coding contribute much less.

Figure 3. Full RDO evaluates every candidate with the actual entropy coder and the actual reconstruction. Production fast RDO uses a cheap SATD pre-filter and aggressive early-termination to evaluate only a handful of candidates per CTU at the cost of a few percent in BD-rate.

RDOQ — bringing RDO down inside the quantiser

A second place where RDO appears, often overlooked, is inside quantisation itself. After the transform, every coefficient is divided by the quantisation step and rounded to an integer. Naive rounding always picks the nearest integer. Rate-distortion optimised quantization (RDOQ) does something different: for every significant coefficient, it considers two or three nearby integer levels and picks the level that minimises J for that one coefficient, taking into account the bit cost of coding it in the entropy coder.

A worked example. Suppose a transform coefficient after division by the QP step lands at 4.4. Naive rounding gives 4. RDOQ asks: "If I instead encode 3, the distortion goes up by ((4.4 − 3)^2 − (4.4 − 4)^2) × step^2 = ((1.4)^2 − (0.4)^2) × step^2 = 1.8 × step^2 units, but the entropy coder needs about 0.3 fewer bits to code a 3 than a 4 in this context. At λ_coeff = 27, the J cost saved is 0.3 × 27 = 8.1 bits-units. If 1.8 × step^2 < 8.1 (i.e. step < 2.12), then 3 wins. Otherwise 4 wins."

RDOQ also drives trellis coefficient zeroing: if the last few non-zero coefficients of a block contribute less to D than they cost in R, zero them. The trellis viewpoint matters because CABAC's context-adaptive coder makes the cost of one coefficient depend on the coefficients around it; the optimal pattern of zeros and nonzeros is found with dynamic programming, not coefficient by coefficient.

RDOQ is what x265 enables at --rd 4 and above and what HEVC HM uses by default. It buys roughly 1–3% BD-rate on top of the surrounding mode-decision RDO. Per-coefficient cost goes up by about 30–40%, which is why disabling RDOQ is the first thing every low-latency encoder does. SVT-AV1 has a similar mechanism it calls trellis quantization; aomenc's term is trellis coefficient optimization, controlled by --disable-trellis-quant. The Ramos et al. 2015 paper "Rate-distortion optimized quantization in HEVC: Performance limitations" is the standard reference for the maths and the limits.

The eight knobs you actually tune

In production, you almost never touch λ directly. You tune the encoder preset and a small number of knobs that change the depth of the mode search. The following table covers the major encoders.

Encoder	Primary depth knob	RDO-quantization knob	Bitrate impact of going one step deeper	Typical CPU impact
x264 (H.264)	`--preset` (ultrafast → placebo)	`--trellis 0/1/2`	0.5–2% per preset step	~1.5× per preset step
x265 (H.265)	`--preset` and `--rd 0–6`	`--rdoq-level 0/1/2`	1–3% per `--rd` step	~1.3× per `--rd` step
libvpx (VP9)	`--cpu-used 0–9`	`--tune-content default/screen/film`	2–5% per `cpu-used` step	~1.4× per `cpu-used` step
libaom (AV1)	`--cpu-used 0–8`	`--disable-trellis-quant` (off = with RDOQ)	2–6% per `cpu-used` step	~1.5× per `cpu-used` step
SVT-AV1 (AV1)	`--preset 0–13`	`--rdoq-level 0/1`	2–5% per preset step in the 4–8 range	~1.4× per preset step
VVenC (VVC)	`--preset faster → slower`	`--rdoq 0/1/2`	3–7% per preset step	~1.7× per preset step
VTM (VVC)	reference encoder, no presets	`RDOQ`, `SignHideFlag` config keys	reference-encoder territory, 10×+ slower than VVenC	n/a

The pattern is the same everywhere: one preset-shaped knob controls the breadth of the search and the depth of RDO, one secondary knob controls RDOQ on or off, and one or two tertiary knobs control tunings like psy-rd (x264, x265) and tune (libvpx, SVT-AV1).

Where it goes wrong — the two common failure modes

Failure mode 1: lambda is wrong for the content. The default λ_mode tables in every encoder were fit on test sets that are heavy on natural content and light on synthetic content. Screen content (presentations, video games, software demos) has very different rate-distortion characteristics — sharp edges, large flat regions, repeated text. The default λ rewards aggressive transform coding on screen content where it should be running palette mode or intra-block-copy. The fix in HEVC and VVC is SCC tools (Screen Content Coding); the fix in encoders is --tune zerolatency plus --tune-content screen or equivalent. Without it, screen content at typical QPs ships 15–20% over its lower-envelope bitrate.

Failure mode 2: SATD pre-filter culls the eventual winner. SATD is fast but it ignores entropy coding context — it can be misled when the actual best candidate has a high SATD score because its residual has unusual energy distribution but very compressible coefficients. The result is "missed mode" artefacts: the encoder converges on a slightly wrong intra direction on edges, producing visible jagged outlines along high-contrast diagonals. Most encoders have a --max-tu-size or --analyse all knob that disables aggressive SATD pre-filtering for problematic content; the cost is real (5–15% CPU), but on titles where it matters the quality impact is visible. This is one of the more common reasons studios push back on a "faster preset" mandate from operations.

A pitfall worth calling out: don't disable RDO entirely on a fast preset, but don't compare presets at the same CRF either. RDO and CRF interact — a fast preset at the same CRF runs at a higher bitrate (because the encoder is making worse choices and needing more bits to compensate). Compare presets at the same bitrate or at the same VMAF, not at the same CRF.

Where Fora Soft fits in

We ship video pipelines for streaming, OTT/Internet TV, video surveillance, e-learning, telemedicine, and conferencing. In every one of those verticals the right RDO setting is different. A telemedicine stream that must run on a 4-core ARM box at 30 fps and 1080p needs a fast preset with RDO trimmed to a tight CTU-level mode list and RDOQ on but minimal. An OTT VOD encode for a streaming service runs at SVT-AV1 preset 4 or x265 --rd 5 with full RDOQ and saves 5–8% bandwidth per title — at scale, that is real money. A surveillance recorder writes 24/7, so RDO needs to chase compute efficiency more aggressively than quality. We measure the trade with internal A/B BD-rate harnesses against ground-truth content from each vertical, not against generic UVG clips.

Looking forward — neural RDO and learned mode decision

The next frontier is replacing the inner loops of RDO with neural networks. Two directions are visible in the 2024–2025 research literature. The first, surveyed by Zhang et al. (CVPR 2025, "Balanced Rate-Distortion Optimization in Learned Image Compression"), is balanced RDO in fully learned image and video codecs, where an end-to-end neural codec is trained with a Lagrangian objective and the network learns its own implicit mode decisions; AV2 will have neural-coded paths along these lines (see the future of neural codecs). The second is drop-in neural mode-decision aides for traditional encoders: a small CNN looks at the block and predicts which 3 of the 67 VVC intra modes to short-list for full RDO, replacing the SATD pre-filter. The 2024 ECCV "Learned Rate Control for Frame-Level Adaptive Neural Video Compression" paper reports an average 14.8% BD-rate improvement over the conventional rate-control + RDO pipeline at the frame level; smaller block-level aides typically buy 1–2% on top of a strong baseline.

The classical RDO framework is not going away. Sullivan and Wiegand's λ-cost is still the loss function for the neural variants. What is changing is the search policy — the procedural decision about which candidates to test and in what order — which is exactly the part that fast-preset shortcuts have been approximating for thirty years. Replacing those shortcuts with a learned policy is a high-leverage area for the next codec generation.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your rate-distortion optimization video coding plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the RDO and Encoder-Preset Tuning Cheat Sheet — One-page reference covering the lambda-vs-QP formula, preset-vs-BD-rate-vs-CPU tables for x265, SVT-AV1, libaom and VVenC, RDOQ flags, and a debugging checklist.

References

G. J. Sullivan and T. Wiegand. "Rate-distortion optimization for video compression." IEEE Signal Processing Magazine, Vol. 15, No. 6, November 1998. — The foundational paper that introduced the Lagrangian framework to video coding.
T. Wiegand and B. Girod. "Lagrange multiplier selection in hybrid video coder control." Proc. ICIP 2001. — Original analysis of λ as a function of QP for hybrid codecs.
A. Tourapis et al. "Rate distortion optimization for H.264 interframe coding: a general framework and algorithms." IEEE Trans. Image Processing, 2007. — Comprehensive RDO framework for H.264/AVC inter prediction.
ITU-T Recommendation H.265 / ISO/IEC 23008-2 (HEVC), v8, 2023. — The HEVC specification.
ITU-T Recommendation H.266 / ISO/IEC 23090-3 (VVC), v3, 2024. — The VVC specification.
AOMedia. AV1 Bitstream & Decoding Process Specification, v1.0.0 errata 1, 2019. — AV1 normative reference.
M. Ramos et al. "Rate-distortion optimized quantization in HEVC: Performance limitations." Proc. PCS 2015. — Quantitative analysis of RDOQ bounds.
x265 project. Command Line Options — x265 documentation. https://x265.readthedocs.io/en/master/cli.html (accessed 2026-05-17). — Authoritative reference for --rd, --rdoq-level, presets.
AOMedia SVT-AV1 project. SVT-AV1 documentation, v2.4, 2026. — Preset taxonomy and RDOQ knobs.
Frontiers Signal Processing, 2023. "The disparity between optimal and practical Lagrangian multiplier estimation in video encoders." — Current restatement of the Sullivan/Wiegand framework.
CVPR 2025. Y. Zhang et al. "Balanced Rate-Distortion Optimization in Learned Image Compression." — Neural-codec view of the same λ-cost.
ECCV 2024. "Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network." — Neural rate control on top of traditional RDO.
Streaming Learning Center benchmarks. "Choosing a Preset for SVT-AV1 and libaom-AV1", 2024. — Production BD-rate-vs-preset benchmarks.

Mode Decision and Rate-Distortion Optimization (RDO)

Why this matters

What "mode" means in a modern codec

The rate-distortion problem in one paragraph

Lambda — the one number that controls everything

The full-RDO loop, step by step

How real encoders cheat — fast mode decision

RDOQ — bringing RDO down inside the quantiser

The eight knobs you actually tune

Where it goes wrong — the two common failure modes

Where Fora Soft fits in

Looking forward — neural RDO and learned mode decision

What to read next

Call to action

References

Related glossary terms

Mode Decision and Rate-Distortion Optimization (RDO)

Why this matters

What "mode" means in a modern codec

The rate-distortion problem in one paragraph

Lambda — the one number that controls everything

The full-RDO loop, step by step

How real encoders cheat — fast mode decision

RDOQ — bringing RDO down inside the quantiser

The eight knobs you actually tune

Where it goes wrong — the two common failure modes

Where Fora Soft fits in

Looking forward — neural RDO and learned mode decision

What to read next

Call to action

References

Related glossary terms

Bitrate

Block

CABAC

Codec

Distortion

Encoder