Why this matters
If you encode video at scale, the bitrate ladder is the single biggest lever you have over both your bandwidth bill and what your viewers actually see. This article is for the streaming or encoding lead, the platform engineer, or the technical product owner who has heard "per-title encoding saves 20%" and wants to understand why from the measurement side — how a quality number, not a guessed bitrate, decides each rung. Get the quality target right and you stop overspending on easy content (a cartoon does not need 6 Mbps) and stop under-serving hard content (a grainy sports clip blocks up at the same 6 Mbps). Get it wrong and you either waste money on quality nobody can see, or ship visible artifacts because a fixed ladder never asked whether the picture was good enough. The goal here is to give you the quality-driven mental model so the convex hull, the quality target, and the CI gate in the rest of Block 5 read as one system.
The one-size-fits-all ladder, and why it fails
Start with the thing per-title encoding replaced. For most of streaming's history, a service picked one bitrate ladder — a fixed list of resolution-and-bitrate pairs, like 1080p at 5800 kbps, 720p at 3000 kbps, down to 240p at 300 kbps — and ran every title through it. The ladder is the set of renditions a player can adaptively switch between as bandwidth changes; the term for that switching machinery is adaptive bitrate (ABR) streaming, and the ladder is its menu.
The fixed ladder makes one quiet assumption: that every video is about equally hard to compress. That assumption is wrong, and it is wrong by a lot. Netflix made the point concrete when it introduced per-title encoding in December 2015: with its old fixed ladder, for some hard-to-compress titles "the highest 5800 kbps stream would still exhibit blockiness," while "for simple content like cartoons, 5800 kbps is far more than needed to produce excellent 1080p encodes" (Netflix Technology Blog, Per-Title Encode Optimization, 2015). The same ladder over-served the cartoon and under-served the action scene. Worse, a viewer on a constrained connection getting 1750 kbps might have been able to watch that cartoon in full HD, but the fixed ladder had pinned 1750 kbps to standard definition for everyone.
How different is "different"? Netflix encoded about 100 of its titles at a constant quality setting and plotted the result. At a 2 Mbps data rate, the hardest titles scored around 38 dB PSNR — the number that compares a compressed frame to the original pixel by pixel, where higher is closer to the original — while the easiest titles topped 48 dB at the same setting (Netflix 2015, via Streaming Media, 2016). Ten decibels of PSNR is not a rounding error; it is the difference between visible blocking and a picture indistinguishable from the source. For orientation, Netflix noted that below about 35 dB its content showed encoding artifacts, and above about 45 dB there was no perceptible improvement left to capture. One ladder cannot sit in the right place for files that span that range.
Figure 1. The reframe at the heart of per-title encoding. A fixed ladder holds the bitrate constant and lets quality float — so an easy title is over-served and a hard title is under-served at the same rung. A quality-target ladder holds the quality constant and lets the bitrate float, giving each title the rungs it actually needs.
The deeper problem is not the specific numbers; it is what the fixed ladder chooses to hold constant. A fixed ladder fixes the bitrate and lets the quality float. It guarantees that the 1080p rung is always 5800 kbps and makes no promise at all about how good that rung looks — which is exactly backwards from what anyone actually cares about. Nobody watching wants a guaranteed bitrate; they want a guaranteed picture. That observation is the whole idea behind per-title encoding, and it is easiest to see if you flip the question around.
The flip: fix the quality, let the bitrate float
Here is the reframe that makes per-title encoding click. Instead of fixing the bitrate and accepting whatever quality falls out, fix the quality and accept whatever bitrate is needed to reach it. Pick a target — say, "the top rung should look essentially indistinguishable from the source" — express it as a number on a perceptual metric, and then let each title's bitrate land wherever it must to hit that number. The cartoon hits the target at 2 Mbps; the sports clip needs 7 Mbps for the same target. Same quality promise, different bitrates, because the content is different.
That is per-title encoding, defined from the quality side: choosing each title's bitrate ladder so that every rung meets a quality goal, rather than a bitrate goal. The metric that makes this practical is VMAF — Video Multi-method Assessment Fusion, the perceptual quality metric Netflix built and open-sourced, scored 0 to 100, where a higher score means the encode looks closer to the original to a human eye (covered in full in VMAF explained). VMAF matters here because it was trained to track human opinion, so "target VMAF 95" is a statement about how the video looks, not about how many bits it spends. PSNR worked for the original 2015 demonstration, but a perceptual metric is what makes a quality target mean what you want it to mean.
This is the single most important sentence in the article, so it is worth stating plainly: in per-title encoding, the quality target is the input and the bitrate is the output. Everything else — how many rungs, which resolution, what CRF value — is machinery for hitting that target efficiently. Once you hold that, the rest of per-title encoding is just careful bookkeeping.
It also draws the boundary for this article. The mechanics of how the encoder produces a given rung — the bitrate-ladder construction, the encoder settings, the per-scene plumbing — belong to the Video Encoding section's per-title article. This article stays on the measurement side: how a quality number decides what the ladder should be. We measure the result; we link out for how to compress.
How a quality target builds a ladder
Turning a quality target into an actual ladder is a three-step process: set the top rung, space the rungs below it, and pick the resolution for each rung. The clearest public walk-through of this is Jan Ozer's, derived from the Netflix method (Streaming Learning Center, 2021); the logic below follows it and names where each number comes from.
Step 1 — Set the top rung from the quality target
The top rung is the highest-quality rendition you offer — the one a viewer with plenty of bandwidth receives. In a quality-driven ladder you do not pick its bitrate; you pick its score and find the lowest bitrate that reaches it. The widely used target band is VMAF 93 to 95, and both ends trace to subjective studies rather than folklore.
The lower end, 93, comes from a RealNetworks study that correlated VMAF against human ratings on 4K clips: at roughly VMAF 93, a service "would be confident of optimally serving the vast majority of their audience with content that is either indistinguishable from original or with noticeable but not annoying distortion" — a mean opinion score between 4 and 5 (Rassool, VMAF Reproducibility, IEEE, 2017). The upper end, 95, comes from a study by researchers at RheinMain University and the streaming service Joyn, who found that VMAF 95 is the lowest score "at which a video signal is on average subjectively indistinguishable from the original video signal" (Kah et al., SPIE, 2021). Pick 93 if you will accept "noticeable but not annoying," 95 if you want "indistinguishable." Either way, the bitrate that achieves it is an output of measuring the title, not a number you set in advance.
Common mistake: a top rung above VMAF 95. If your highest rung scores 97 or 98, you are spending bandwidth on quality no viewer can see. Multiple subjective studies put the ceiling of useful quality at around 95 for large-screen viewing; above it, the extra bits buy a difference below the threshold of human perception. The fix is not to "encode at high quality" — it is to find the lowest bitrate that reaches the target and stop there. On a hard 4K clip that one correction can save a couple of megabits per second on the most-watched rung.
A practical caveat: some titles are so hard that hitting VMAF 95 would blow past a sane bitrate ceiling. Most services therefore set a maximum top-rung bitrate (say 6 Mbps for 1080p H.264) and accept a slightly lower score on the rare clip that needs more. The target is a goal, not a contract you bankrupt yourself honouring.
Step 2 — Space the rungs so switching stays invisible
Below the top rung, each lower rung exists so a player on less bandwidth can still get the best possible picture for its connection. Two rules govern the spacing, and a good ladder respects both.
The quality rule comes from how finely viewers can perceive a VMAF difference. The RheinMain/Joyn study found viewers could not reliably tell two renditions apart when their VMAF scores were within about 2 points; beyond that, differences became visible (Kah et al., 2021). So if you want adaptive switches to be unnoticeable, neighbouring rungs should sit no more than roughly 2 VMAF apart. Hold that thought against a coarser, more famous figure: Netflix's 2017 guidance that about 6 VMAF points is one just-noticeable difference (JND) — a change most viewers notice more than half the time. Two points is conservative (switches invisible); six points is the threshold where a switch becomes obvious. Pick your spacing inside that band depending on how much you care about switch smoothness versus rung count.
The bitrate rule comes from ABR mechanics, and it pulls the other way. Apple's long-standing guidance (Technical Note TN2224, since folded into the HLS Authoring Specification) is that adjacent bitrates should be a factor of 1.5 to 2 apart. Too close together and you add renditions that barely differ, multiplying encoding and storage cost for no benefit; too far apart and a player can get stranded on a low rung when it had bandwidth for much better. A common compromise is to multiply each rung's bitrate by about 0.6 going down, giving roughly a 1.66× step.
These two rules interact. The quality rule says "do not let neighbours drift more than ~2 VMAF apart"; the bitrate rule says "do not pack them closer than ~1.5× in bitrate." On easy content the bitrate rule usually binds (few rungs needed); on hard content the quality rule binds (you need more rungs to keep the steps small). The number of rungs is therefore itself per-title. The same RheinMain/Joyn analysis showed how far this can go: to keep every neighbour within 2 VMAF across a full range, an ideal ladder could need 13 rungs for a paid service (floor VMAF 70) or 21 for a free one (floor VMAF 55) — far more than most services ship, which is a deliberate cost trade-off, not an oversight.
Step 3 — Pick the resolution for each rung: the convex hull
The last step answers a question the fixed ladder never asked: at a given bitrate, which resolution looks best? The answer is not always "the highest one." At a low bitrate, a 1080p encode has to spread its few bits across four times as many pixels as a 540p encode, so it often looks worse — blockier, softer — than the lower resolution scaled up to fit the screen. At a high bitrate, the higher resolution pulls ahead because it finally has the bits to render the extra detail.
Plot quality against bitrate for several resolutions and you get one curve per resolution, each crossing the others. The upper envelope of all those curves — the line that, at every bitrate, follows whichever resolution is currently winning — is the convex hull. Each rung of a per-title ladder sits on that hull: you choose the bitrate from steps 1 and 2, then take the resolution whose curve is on top at that bitrate. This is the geometry behind per-title encoding, and it gets its own full treatment with the worked plot in the convex hull article — here it is enough to know that "best resolution per rung" is a measured result, not a guess.
Figure 2. From one quality target to a full ladder. The target sets the top rung (lowest bitrate reaching the score); the quality rule (≤ ~2 VMAF between neighbours) and the bitrate rule (1.5–2× apart) together space the lower rungs; and the convex hull picks the best resolution at each rung's bitrate.
A worked example: one target, two very different ladders
Numbers make the saving concrete. Take two titles and give them the same quality promise — a top rung at VMAF 95 — and watch the bitrates diverge.
Title A, an animated short. Flat colour areas, clean edges, little motion: easy to compress. Brute-force encoding finds that 1080p reaches VMAF 95 at about 2.0 Mbps; beyond that the curve is flat, so more bits buy nothing. Top rung: 1080p, 2.0 Mbps.
Title B, a grainy night football clip. Film grain, fast motion, crowds: hard to compress. The same 1080p resolution does not reach VMAF 95 until about 7.5 Mbps. Top rung: 1080p, 7.5 Mbps.
Now compare both to an old fixed ladder that pinned the 1080p rung at 5800 kbps for everyone. For Title A, the fixed ladder spent 5.8 Mbps to deliver a rung that only needed 2.0 — wasting 3.8 Mbps, about a 66% overspend on the most-watched rendition, for quality past the point any viewer would notice. For Title B, the fixed ladder's 5.8 Mbps could not reach the target at all; it shipped a visibly worse rung than the 7.5 Mbps the content actually demanded. One ladder, two failures, in opposite directions — and a single quality target fixes both, because it asks each title the only question that matters: what does it take to look this good?
Figure 3. Same quality target, two ladders. Held to a VMAF 95 top rung, an easy animation reaches the target at about 2.0 Mbps while a hard sports clip needs about 7.5 Mbps. A fixed 5.8 Mbps ladder over-serves the first and under-serves the second; the quality target serves both correctly.
The companion quality-target ladder builder runs exactly this calculation on your own measurement grid: feed it the per-resolution VMAF-versus-bitrate points for an asset, give it a target score and a spacing rule, and it returns the per-title ladder and the bitrate saving against a fixed ladder you specify. It reproduces both titles above in its --demo.
Per-shot encoding: the same idea, one level deeper
A whole title is not uniformly hard. A film has a quiet dialogue scene and then an explosion; a match has a static pre-game shot and then a fast break. Per-title encoding picks one ladder for the whole title, which means it compromises — spending the explosion's bitrate on the dialogue, or vice versa. Per-shot encoding removes that compromise by choosing the encoding recipe per shot instead of per title.
The technique grew out of Netflix's Dynamic Optimizer, a shot-based framework it deployed in 2018. The idea rests on one assumption: frames within a single shot share spatio-temporal characteristics — similar motion, similar detail — so a shot is the natural unit to optimize. For each shot and a given target bitrate, Dynamic Optimizer determines the optimal resolution and quantization setting that meet (but do not exceed) that bitrate while maximizing the shot's VMAF, building a rate-distortion-optimal convex hull per shot rather than per title (Netflix, Dynamic Optimizer, 2018; Katsavounidis et al., 2018). The optimization objective is the perceptual metric itself: the system is literally choosing settings to maximize measured quality at a quality-aware bitrate.
The payoff is real. Measured at equal VMAF, per-shot dynamic optimization reduced bitrate by about 28% for x264, 34% for x265, and 38% for VP9 versus fixed-quality encoding in Netflix's published comparison (Katsavounidis et al., 2018) — roughly a 30% saving on top of, or instead of, the ~20% that per-title alone delivers. The same content, the same perceived quality, a third fewer bits.
That saving is not free, and the cost is the reason per-shot is not automatically the right answer. Optimizing each shot means encoding each shot many times — at multiple resolutions and quality points — to find its convex hull. Netflix reported that shot-based optimization increased the number of encode units "by more than two orders of magnitude per encode, per title," which exposed real bottlenecks in its parallel encoding pipeline (Netflix, Optimized shot-based encodes, 2018). A hundred-fold increase in encodes is affordable when a title will be watched tens of millions of times; it is harder to justify for a long-tail catalogue watched a handful of times. The decision is an economic one — encoding compute spent now against bandwidth saved over the title's lifetime of views — and it is exactly the kind of trade-off the business case article frames in money.
Figure 4. Three granularities of the same idea. A fixed ladder treats every title and every shot alike; per-title gives each title its own quality-driven ladder (~20% bitrate saving); per-shot optimizes each shot (~30% saving) at the cost of two orders of magnitude more encodes per title.
What the quality target does not tell you
A quality-driven ladder is a powerful tool, and like every metric-driven decision it has blind spots. Naming them is the difference between using the number and being fooled by it.
A score is meaningless without its model. "VMAF 95" is not one thing. VMAF ships multiple models — a default model tuned for television viewing, a phone model, a 4K model — and the same encode scores differently under each. A ladder built for living-room TVs using the default model can be wrong for a mobile-only audience, which needs the phone model. Always state the model with the target, and match it to how the content is watched. The model-selection nuance is covered in VMAF in depth.
Mean-pooling hides bad shots. A title's single VMAF score is usually the average of its per-frame scores, and an average can be excellent while a few shots are terrible. A two-hour film at VMAF 95 can still contain a thirty-second grain-heavy scene at VMAF 80 that the mean quietly absorbs. Per-shot encoding helps here precisely because it scores each shot, but if you gate a per-title ladder on the mean alone you can ship a visible failure inside an "excellent" number. Look at the low percentile, not just the mean — the reasoning is in pooling per-frame scores.
Comparing across resolutions is a measurement trap. Every rung in a per-title ladder is at a different resolution, but VMAF is defined against a same-size reference. To compare a 540p rung to the 1080p source you must upscale the decoded 540p frame back to the source resolution before scoring — that is how cross-resolution VMAF is computed. Skip that step and the numbers are not on the same scale, and the convex hull you build from them is wrong.
Optimizing for the metric is not optimizing for the viewer. If your ladder rewards VMAF and your encoder can raise VMAF with a sharpening or contrast trick that does not actually improve perceived quality, you will get a higher number and no happier viewer — the metric becomes a target and stops being a good measure. This is common enough that Netflix built a specific countermeasure, the no-enhancement-gain model, covered in VMAF-NEG explained. Treat any quality target as a proxy that can be gamed, and validate it against the eye on a sample. The broader catalogue of where objective metrics mislead is in where objective metrics lie.
CRF, QP, and the knob that actually hits a target
One practical question remains: once you know the bitrate and resolution for a rung, how does the encoder produce it at the right quality? The honest answer is that you rarely command a quality target directly; you command a setting that produces it. The two relevant settings are QP (quantization parameter) and CRF (constant rate factor) — both are constant-quality modes that hold a chosen quality roughly steady and let the bitrate vary with scene difficulty, which is exactly the behaviour a quality target wants. Netflix used constant-QP test encodes to map each title's difficulty in 2015, then noted it had "migrated to CRF" with about the same results (David Ronca, Netflix, 2016).
In a real per-title pipeline, the brute-force analysis predicts the CRF (or QP) value that lands each resolution near its target bitrate, often combined with a bitrate cap so a hard scene cannot run away — an arrangement called capped CRF. The details of these rate-control modes — how CBR, VBR, CRF, and capped CRF differ and when to use each — are codec-side mechanics, and they live in the Video Encoding section's rate-control article. For the measurement view, the thing to remember is that CRF is the knob you turn to approximate a quality target, and VMAF is how you verify you actually hit it. The setting predicts; the metric confirms.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming and OTT, video conferencing, e-learning, telemedicine, and surveillance — and the bitrate ladder is one of the first things we tune when a client's bandwidth bill or picture quality becomes a problem. We approach it from the quality side described here: pick a VMAF target appropriate to the platform and content, measure each asset to find the lowest bitrate that reaches it, and build the ladder from the convex hull rather than from a copied list of bitrates. For an OTT or e-learning catalogue the saving is mostly bandwidth and storage; for a conferencing or surveillance product, where content is live and there is no pristine reference to score against, the same quality-target thinking shifts to no-reference methods and a different part of the pipeline. Our Block 7 benchmark methodology applies the same measure-then-decide discipline to our own codec tests, so the ladders we recommend rest on numbers, not habit.
What to read next
- The Convex Hull: Optimal Bitrate-Resolution Points
- Setting a Quality Target and a Quality Budget
- VMAF Explained: Netflix's Perceptual Metric
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your per-title encoding plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Netflix Technology Blog (A. Aaron, Z. Li, A. Norkin, et al.), "Per-Title Encode Optimization," 2015. Tier 4 (vendor engineering blog — the defining work for per-title encoding from the operator who introduced it). Establishes the premise that each title should get a unique bitrate ladder tailored to its complexity, the fixed-ladder failure modes (5800 kbps blocks on hard content, over-serves cartoons), the constant-QP complexity diversity across ~100 titles, the convex-hull method, and the ~20% bitrate saving. Basis for the fixed-ladder section and the per-title definition. https://netflixtechblog.com/per-title-encode-optimization-7e99442b62a2
- VMAF — Video Multi-method Assessment Fusion (repository and documentation), Netflix (Z. Li, C. Bampis, et al.), with the Netflix Tech Blog VMAF posts. Tier 1 (metric-author defining work). The perceptual full-reference metric used as the quality target and the per-shot optimization objective; source of the model families (default / phone / 4K) and the ~6 VMAF points ≈ 1 JND framing (Netflix 2017 guidance). Basis for using VMAF as the target and for the model/pooling caveats. https://github.com/Netflix/vmaf
- Netflix Technology Blog / I. Katsavounidis, "Dynamic optimizer — a perceptual video encoding optimization framework," and "Optimized shot-based encodes: Now Streaming!," 2018. Tier 4 (vendor engineering blog — the defining work for per-shot/shot-based optimization). Source of the per-shot method (optimal resolution + QP per shot maximizing VMAF at a target bitrate), the rate-distortion convex hull per shot, and the "more than two orders of magnitude" increase in encode units. Basis for the per-shot section. https://netflixtechblog.com/dynamic-optimizer-a-perceptual-video-encoding-optimization-framework-e19f1e3a277f
- I. Katsavounidis and L. Guo, "Video codec comparison using the dynamic optimizer framework," SPIE Applications of Digital Image Processing XLI, 2018 (see also arXiv:1808.03898, "Towards Perceptually Optimized End-to-end Adaptive Video Streaming"). Tier 5 (peer-reviewed). Source of the per-shot bitrate-saving figures at equal VMAF: ~28% (x264), ~34% (x265), ~38% (VP9). Basis for the per-shot savings numbers. https://arxiv.org/pdf/1808.03898
- A. Kah, C. Friedrich, T. Rusert, C. Burgmair, W. Ruppel, M. Narroschke, "Fundamental relationships between subjective quality, user acceptance, and the VMAF metric for a quality-based bit rate ladder design for over-the-top video streaming services," SPIE, 2021. Tier 5 (peer-reviewed). Source of VMAF 95 = "on average subjectively indistinguishable from the original," the ~2-VMAF just-distinguishable threshold for rung spacing, and the 55/70 acceptance floors and resulting rung counts (21 free / 13 paid). Basis for the top-rung ceiling and the quality spacing rule. https://www.hs-rm.de/fileadmin/user_upload/SPIE_11842-38_HSRM.pdf
- R. Rassool, "VMAF Reproducibility: Validating a Perceptual Practical Video Quality Metric," IEEE BMSB, 2017. Tier 5 (peer-reviewed). Source of the VMAF ~93 finding — a service encoding to about VMAF 93 can confidently serve the vast majority of its audience with content "indistinguishable from original or with noticeable but not annoying distortion" (MOS 4–5). Basis for the lower end of the 93–95 top-rung band. https://realnetworks.com/sites/default/files/vmaf_reproducibility_ieee.pdf
- Apple, "Technical Note TN2224 — Best Practices for Creating and Deploying HTTP Live Streaming Media" (since folded into the HLS Authoring Specification for Apple Devices). Tier 3 (first-party platform guidance). Source of the rung-spacing rule that adjacent bitrates should be a factor of 1.5 to 2 apart, with the rationale (too close wastes renditions; too far strands the player). Basis for the bitrate spacing rule. https://developer.apple.com/documentation/http-live-streaming/hls-authoring-specification-for-apple-devices
- Recommendation ITU-R BT.500-15, "Methodologies for the subjective assessment of the quality of television pictures," International Telecommunication Union, 2023. Tier 1 (official standard). The subjective-assessment methodology underlying the VMAF-vs-MOS correlations the top-rung thresholds rest on; cited as the ground truth that every VMAF target is ultimately validated against. Basis for the "a metric is a proxy validated against subjective scores" framing. https://www.itu.int/rec/R-REC-BT.500
- G. Bjontegaard, "Calculation of average PSNR differences between RD-curves," ITU-T VCEG document VCEG-M33, 2001. Tier 1 (metric-author defining work). Defines BD-rate, the bitrate difference at matched quality used to quantify per-title and per-shot savings on the rate-quality curve. Cited for the "saving at equal quality" framing and cross-linked to the BD-rate article. https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc
- J. Ozer, "Formulate the Optimal Encoding Ladder with VMAF" (Streaming Learning Center, 2021) and "Identifying the Top Rung of a Bitrate Ladder" (OTTVerse, 2022). Tier 6 (educational / practitioner). The clearest public step-by-step of the Netflix method — top rung at VMAF 93–95, ×0.6 (≈1.66×) rung spacing, highest-VMAF resolution per rung — and the premium-vs-UGC top-rung survey (premium ~95–96, UGC 84–92). Orientation for the three-step ladder procedure. https://streaminglearningcenter.com/encoding/optimal_encoding_ladder_vmaf.html


