AI Inside the Encoder: ML-Based Mode Decision, Partitioning,

Why this matters

If your team has ever debated whether to switch from x264 to SVT-AV1, whether a 30% lower bitrate from a new "AI encoder" is real, or whether per-title encoding will save enough CDN cost to justify the engineering work, you are arguing about the four decisions in this article whether you call them that or not. Every "AI-powered" encoder pitch you will read this year — from Google's Argos ASIC, NETINT's VPUs, Beamr Cloud, Bitmovin, and every academic spinout selling a "neural codec speedup" — is built on the same four levers, and once you can name the levers you can ask the questions that separate a real product from a marketing deck.

This article is for the product manager, founder, video operations lead, or engineering manager who needs to make codec decisions, read encoder vendor pitches without getting bluffed, or argue about bitrate ladders with confidence. We start from the place every classical encoder spends most of its time — the partition tree — and work outward to scene cuts and quantization. Each section names the classical method first in plain language, then shows where the ML model slots in, then states the published results so you know what is real and what is a slide.

What "AI inside the encoder" actually means

The phrase "AI inside the encoder" in this article means one specific thing: a machine-learning model — most often a small convolutional network, a gradient-boosted tree, or a support vector machine — embedded inside a classical video encoder like x265, SVT-AV1, libaom, or VVenC, where the model replaces a slow exhaustive search with a fast prediction. The output of the encoder is still a fully-compliant bitstream that any H.264, HEVC, AV1, or VVC decoder in the world can play. The AI never touches the bitstream; it only touches the decision-making logic that the encoder uses to produce the bitstream.

This is a different family of technology from "end-to-end neural codecs" — research codecs like Google's HiFiC, Microsoft's Neural Video Coding (NVC), and the MPEG-NNVC work that replace the entire pipeline with deep networks and emit a non-standard bitstream that needs a neural decoder. ¹ ² End-to-end neural codecs are an active and exciting research area, and they will get their own article in this Learn section. What they are not, in 2026, is the technology shipping inside the encoders that handle the world's video traffic. Almost every byte you watched yesterday went through a classical encoder, and the AI gains in that encoder are local: faster partition decisions, smarter scene cuts, better quantizer maps.

The distinction matters because every "AI encoder" pitch you read needs to be sorted into one of these two buckets before you can evaluate it. A product that claims "30% bitrate savings using AI" while emitting a standard AV1 bitstream is doing one of the four things in this article. A product that emits a bitstream only its own decoder can read is in the other camp and you should ask very different questions about it — chiefly, who is going to install that decoder on every viewer's device.

The four ML drop-ins in classical encoders share a common shape. Each one targets a single decision the classical encoder spends a lot of compute on. Each one replaces an exhaustive or partially exhaustive search with a model that predicts the answer in microseconds. Each one is judged on the same two-axis scoreboard: encoding time saved and BD-rate cost — short for Bjøntegaard Delta Bitrate, the standard way to express "how many more bits do I need to spend at the same quality". ³ A good ML drop-in saves 30–60% encoding time for a BD-rate cost under 2%; an excellent one saves more and costs less.

Pipeline diagram showing a classical encoder with four ML drop-in points highlighted: partition decision, mode decision, scene-cut detection, and perceptual quantizer map, each feeding into the standard encode loop that emits a compliant bitstream Figure 1. The four places a machine-learning model lives inside a 2026 classical encoder. The bitstream that comes out is fully standard-compliant; the AI only changes which decisions the encoder explores and how fast it picks the winner.

The partition tree — the largest compute sink in modern codecs

To understand why partitioning is the place every encoder team puts its first ML model, you have to look at where the encoder spends its time. In an HEVC encoder, the coding tree unit (CTU) — the largest block of pixels the encoder treats as a unit — is up to 64×64 samples and can be split recursively into smaller coding units (CUs) down to 8×8. ⁴ In AV1 the unit is the superblock, up to 128×128, splittable into much smaller partitions. In VVC the unit is again called a CTU, and it uses a QT-MTT (Quadtree plus Multi-Type Tree) structure that adds binary and ternary splits to the classical quadtree, multiplying the number of partition shapes by an order of magnitude over HEVC. ⁵

The encoder's job at each level of that tree is to decide whether to split the block further or stop and encode it as one. Stopping early saves bits on the partition syntax but spends bits on a less-optimal prediction. Splitting deeper buys more accurate prediction at the cost of more partition bits and more compute. The classical answer is rate-distortion optimisation (RDO): try both, measure the cost in bits and the cost in distortion, pick the winner. Doing that recursively, on a CTU that can split into hundreds of leaf shapes, is why partition decisions account for 60–80% of the wall-clock time inside a modern HEVC, AV1, or VVC encoder. ⁶ Cut that number in half and you have cut the cost of the entire encode in half.

This is the place where ML pays for itself fastest, and the published results are striking. A 2023 MDPI Electronics paper on VVenC reported a LightGBM-based fast partitioning scheme that saved between 30.21% and 82.46% of encoding time depending on the preset, at a BD-rate cost of 0.67% to 3.01%. ⁷ A 2024 Journal of Real-Time Image Processing paper on a similar approach for libaom-AV1 — LACCO, a learning-based AV1 complexity controller — reported comparable speedups at HD and UHD resolutions. ⁸ These are not toy numbers from a research lab; they are the published basis of the partition-decision modules that ship in the latest releases of VVenC and reference AV1 implementations.

The mechanics, stripped of the math, work like this. The encoder extracts a small set of features from the current block — variance, average gradient, texture energy, neighbour-block decisions, motion vector magnitude — and feeds them to a pre-trained classifier. The classifier outputs a single number per partition mode: the probability that this mode will win the rate-distortion contest. The encoder then either skips the modes whose probability is below a threshold, or runs RDO on only the top two or three candidates. The unexplored modes are cut from the search tree entirely, and the encoder runs much faster.

Two model families dominate production encoders today. Convolutional neural networks are used when the encoder can afford to look at the raw pixels of the block — a small CNN with one or two convolutional layers reads the 64×64 or 128×128 patch and outputs a partition prediction directly. ⁹ CNNs give the best accuracy but cost the most compute per call, so they are reserved for the slowest, highest-quality presets. Gradient-boosted trees like LightGBM and XGBoost are used everywhere else — they read a hand-crafted feature vector and run in tens of nanoseconds, fast enough to call several times per CTU without dominating the encoder's runtime. ⁷ Support vector machines and small fully-connected networks with one or two hidden layers and 16 to 64 nodes per layer are the third option; that is exactly the structure SVT-AV1 ships in production, where the network is called before every partition mode to decide if it can be skipped, and an SVM consults the result of the "None" mode to decide whether to terminate the whole partition search. ¹⁰ ¹¹

The trade-off the encoder team has to manage is the same one a chess engine team manages. A bigger, smarter model makes better partition decisions and so produces smaller bitstreams at the same quality — but it also costs more compute per call. Past a certain point the model itself becomes the new bottleneck. The reason SVT-AV1's networks are tiny (16–64 nodes, one or two hidden layers) is that the encoder is making millions of partition decisions per second on a 4K stream, and a 100-million-parameter model called that often would simply stall. The art is in finding a model just big enough to beat the classical heuristic and small enough not to become the new constraint.

Diagram showing a 128x128 AV1 superblock being recursively split into smaller partitions, with a small neural network ml-gate at each branch deciding whether to split further or skip the search, and a histogram showing the encoder time saved at each tree depth Figure 2. A typical AV1 superblock partition tree with ML gating. At each split decision, a tiny classifier predicts whether full RDO is worth running. Most blocks settle on a coarse partition early and skip the deeper search entirely; the encoder's wall-clock cost collapses to roughly the cost of evaluating the winning branch.

A small example fixes the orders of magnitude. The classical HEVC encoder evaluates the 64×64 CTU at one depth, then the four 32×32 CUs at the next, then the sixteen 16×16 at the next, then the sixty-four 8×8 at the deepest — and at each block it tests several intra-prediction modes and an inter-prediction mode for both reference lists. A naive count is 1 + 4 + 16 + 64 = 85 blocks, each tested with maybe 30 + 1 modes, so roughly 2,635 RDO evaluations per CTU. A 4K HEVC frame has (3840 × 2160) ÷ (64 × 64) ≈ 2,025 CTUs, so a single frame at the slowest preset is on the order of 5.3 million RDO evaluations. A typical fast-partitioning ML model cuts that count by 60%, leaving roughly 2.1 million — and at 60 fps over a 10-minute clip the saved evaluations run into the trillions. That is the entire performance story.

The catch, and the place every encoder team gets bitten, is training data. A partition model trained on a particular content distribution — say, animation, or sports, or talking-head conferencing — will generalise badly to content it was not trained on. In production this is mitigated by training on a deliberately diverse corpus that covers screen content, animation, live sports, fiction, and user-generated camera footage, and by keeping the model conservative: a model that errs toward "don't skip" loses speed but does not lose quality. A model that errs toward "skip" loses both. SVT-AV1's networks are trained on hours of curated content drawn from Netflix's internal pool, which is one of the reasons the encoder generalises as well as it does. ¹⁰ If a vendor claims a "20% encoding speedup with no quality loss" and cannot tell you what content their model was trained on, you have learned something useful about the vendor.

Mode decision — picking which prediction wins

Once the partition is decided, the encoder still has to pick the prediction mode for each leaf block. In HEVC, an intra-coded block can use one of 35 directional intra modes plus DC and planar; in AV1 the number is similar but with different angles and an additional set of recursive intra modes; in VVC it grew to 67 angular intra modes plus a half-dozen new tools like multiple reference lines, intra sub-partitioning, and matrix-based intra prediction (MIP). ¹² Picking the right mode is again an RDO problem — evaluate each candidate, measure cost-in-bits versus distortion, take the winner — and again it is expensive.

ML drop-ins here look a lot like the partition models, with one important architectural choice: gating versus ranking. A gating model takes the block's features and outputs, for each candidate mode, a probability of being the winner; modes below a threshold are dropped. A ranking model takes the same features and outputs an ordered list of the top-K most-likely winners; the encoder runs full RDO only on the top K. In practice almost every production deployment uses ranking, because gating has a long tail of rare but expensive failures: when the true winner sits in the dropped set, the resulting block is poorly predicted and the BD-rate cost spikes for that block specifically. Ranking lets the encoder cap K at 3 or 5, accept a small constant cost per block, and still capture more than 95% of the speedup. ¹³

The published numbers are again in the same band as partition models. A 2021 paper from Circuits, Systems, and Signal Processing on Adaptive CU Mode Selection in HEVC Intra Prediction with a deep-learning approach reported encoder time reductions up to 66.89% with a BD-rate loss of 1.31% over the state-of-the-art classical baseline, on HEVC at the slowest preset. ¹³ A 2024 PMC review of CNN-based intra-prediction approaches for HEVC and VVC summarised typical results in the 40–70% time-saving and 0.5–2.5% BD-rate-loss ranges, depending on resolution and content. ¹⁴

A useful pattern from the mode-decision literature is early-exit cascading. The encoder runs a cheap classical heuristic first, and only calls the ML model if the heuristic's confidence is low. For blocks where the variance and motion are both small, the planar or DC intra mode is almost always correct, and the encoder takes it directly. For blocks where the gradient is strong and oriented, the encoder tests the dominant angle plus its neighbours and stops. The ML model is reserved for the ambiguous middle — blocks where the classical heuristic cannot pick a clear winner and the encoder would otherwise have to fall back on full RDO. This pattern is why the average speedup of an ML drop-in is much larger than the cost of the model itself: most blocks never call the model at all.

Common mistake: assuming "AI mode decision" means a giant network

A specific pitfall sits at the intersection of partition and mode decision, and product teams new to the space fall into it routinely. They read an "AI encoder" pitch, assume the AI is a hundred-million-parameter transformer, and either reject the technology as too heavy to deploy at scale or assume it will be transformative.

Neither reaction is right. The networks that ship in production encoders are tiny by modern ML standards — typically thousands to low tens of thousands of parameters, one or two hidden layers, a handful of inputs. SVT-AV1's partition networks fit comfortably in a CPU cache. ¹⁰ They are not "AI" in the consumer sense of generative models; they are classical pattern classifiers, trained once offline on a large content corpus and called millions of times per second in production. They do not consume meaningful additional CPU or memory at runtime, and they do not require a GPU. A vendor who tells you their encoder needs an H100 to run is either using ML in a fundamentally different place (e.g., for upscaling or denoising before the encoder) or has built something that will not scale.

The correct test for an "AI inside the encoder" claim is therefore: what specific decision does the model make, how big is it, and what's the BD-rate trade-off in their own published numbers? If a vendor cannot answer those three questions concretely, the AI in the pitch is marketing.

Scene-cut detection — where the I-frames go

The third place AI lives in the encoder is scene-cut detection, also called scene change detection or scenecut. The goal is to put a fresh I-frame — a frame that does not depend on any other frame for decoding — at every place the content changes enough that a P-frame or B-frame could not predict the new content from the old one. Get the placement right and the encoder spends bits efficiently and the stream supports clean seeking; get it wrong and you either spend too many bits (an I-frame inside a static shot) or too few (no I-frame at a sharp cut, so the next several frames are badly predicted and look bad). ¹⁵

The classical algorithm in x264, x265, and libaom is a simple frame-difference heuristic. For each frame in the lookahead window, the encoder computes a metric — usually a sum of absolute differences between the current frame and the previous one, weighted by a low-pass-filtered version of the frame to suppress noise — and compares it to a threshold. ¹⁵ In x264 the threshold is exposed as --scenecut and defaults to 40; values higher than 40 make the encoder more eager to declare a cut and place an I-frame, values of 0 disable cut detection entirely. The same logic with slightly different constants runs in x265 and SVT-AV1.

The classical heuristic works most of the time, and it is fast enough that the savings from replacing it with an ML model are smaller in absolute terms than for partition or mode decision. But the errors the heuristic makes are systematic and expensive. It miscounts cuts in three places where humans never would: fades and dissolves (slow scene transitions where consecutive frame differences are small but the content is in fact changing), fast pans and high motion (where consecutive frame differences are huge but the scene has not actually changed), and content with deliberate sharp transitions inside a single shot like flashes, lightning, or strobing — where the classical detector inserts an I-frame for a single bright frame and then immediately needs another one when the scene returns to normal. ¹⁵

An ML scene-cut detector replaces the hand-coded threshold with a small classifier trained on labelled cuts. The classifier sees the same frame-difference metric plus a handful of others — colour histogram distance, luminance histogram distance, motion-vector statistics, edge density, the same low-pass-filtered residual the classical detector uses — and outputs a probability that this position is a true scene boundary. Adaptive group-of-pictures (GOP) algorithms then use that probability to decide whether to start a new GOP at this frame, extend the current GOP, or insert an open-GOP boundary that is cheaper than a closed I-frame. ¹⁶

Production deployments are split. Open-source encoders — x264, x265, SVT-AV1, libaom — still ship classical detectors with hand-tuned thresholds, partly because the BD-rate gain from an ML detector is modest (research papers report 1–3% on long content with many cuts, lower on continuous content) and partly because retraining and shipping a model is more operational overhead than tuning a constant. Commercial cloud encoders — Bitmovin, AWS Elemental, Beamr, NETINT — increasingly use ML scene-cut as part of broader content-aware encoding pipelines, where the detector's output also drives per-shot bitrate ladders and adaptive lookahead lengths. ¹⁷ ¹⁸ The latter is where ML scene-cut earns its keep: not in the BD-rate of a single GOP, but in feeding the per-shot encoder downstream that does its own per-segment rate-distortion optimisation. ¹⁹

Perceptual quantization — the deepest old-school AI

The fourth and oldest member of this family is perceptual quantization — the part of the encoder that decides how many bits to spend on each block by predicting where the human visual system will and will not notice the loss. The "AI" here pre-dates the current ML wave by twenty years, but the underlying logic is the same: a model of human perception trained on data, used at runtime to redistribute bits inside the frame.

The classical quantizer is one number per slice — a QP, or quantization parameter — that controls how aggressively the encoder rounds away the transform coefficients in every block of that slice. ²⁰ A perceptual quantizer adjusts the QP per block: it spends more bits (lower QP) on blocks where the eye will notice the loss and saves bits (higher QP) on blocks where the eye will not. The savings inside the frame can be redistributed to harder regions, so the total bitrate stays the same and the perceived quality improves.

Four mechanisms in modern open-source encoders implement this in slightly different ways, and you should know all four because every "perceptual encoding" pitch sits on top of one of them.

Variance-based adaptive quantization (the original x264 aq-mode 1, ported to x265 and SVT-AV1) computes the variance of each macroblock and lowers QP for low-variance — flat — regions while raising QP for high-variance — textured — regions. ²⁰ The motivation: the eye is sensitive to banding and contour artefacts in smooth regions but tolerant of detail loss in textured ones. The relation used is qscale_new = qscale × variance^C, where C is a small negative constant tuned through experiment. ²⁰ The mode has shipped in x264 since 2009 and remains the default in most slow-preset configurations because it consistently improves perceived quality at no measurable bitrate cost.

Macroblock-tree (the x264 --mbtree option) is a more sophisticated version that tracks how each macroblock's quality propagates forward in time through inter-frame prediction. ²¹ Blocks that get referenced by many later P- and B-frames receive proportionally more bits, because their distortion will be inherited by everything that predicts from them. Blocks that are referenced rarely or not at all (the rightmost moving objects, blocks at the edge of a scene) receive proportionally fewer. The result is "adaptive B-frame quantizer offset" — in static scenes, B-frames end up with QPs 4–6 points higher than the surrounding P-frames; in fast-motion scenes the difference collapses to near zero. ²¹ The visible effect is sharper backgrounds and slightly softer fast-moving foreground, which most viewers prefer.

Psy-RD (x265's --psy-rd and --psy-rdoq, defaults to 2.0 and 0.0 respectively, with --psy-rdoq raised to 1.0 in slow and slower presets) directly modifies the rate-distortion cost function so that the encoder penalises reconstructions whose energy — high-frequency content — differs from the source. ²² In a normal RDO, two candidates with the same SSE (sum of squared errors) get the same cost. With Psy-RD, the candidate that better preserves the texture of the source — even at the cost of a higher SSE — wins. The user-visible effect is that compressed video keeps its grain and film texture instead of being smoothed into glossy plastic, which dramatically improves perceived quality on cinematic content. By default, x265 always tunes for highest perceived visual quality, which means these psycho-visual optimisations are enabled by default; users who want maximum PSNR (for benchmarks or transcoding to a different codec) have to explicitly turn them off. ²²

Just-noticeable-distortion (JND) and visual-attention-guided quantization are the modern AI additions. A JND model uses a CNN to predict, per region, the smallest distortion the average viewer would notice; a saliency model — usually a small CNN trained on eye-tracking data — predicts which regions of the frame viewers will actually look at, and bumps the bit allocation there. ²³ A 2023 KTH thesis on visual-attention-guided adaptive quantization for x265 reported BD-rate improvements on the order of 3–8% on perceptual metrics like VMAF, at no extra cost in PSNR. Saliency models are the place where modern deep learning has the largest impact on encoder output for an end-viewer's perceived quality, and they are increasingly the default in cloud encoder offerings like Bitmovin's per-title pipeline and AWS Elemental's MediaConvert. ¹⁷

The pitfall here is the easiest of all four to fall into: never benchmark a perceptual encoder against PSNR. Every one of the four mechanisms above will raise PSNR-measured distortion in some blocks on purpose, because the bits saved there are spent elsewhere to improve perceived quality. A vendor who runs their encoder against PSNR or its variants and reports a small loss is doing it right; a vendor who reports a PSNR gain has probably turned off the perceptual mode for the benchmark. The right metrics for perceptual encoders are VMAF, SSIMULACRA2, and human MOS — not PSNR. See our article on Objective Quality Metrics for why these metrics exist and when to use which.

Bar chart comparing the four perceptual quantization mechanisms in modern encoders: variance-AQ, mbtree, psy-rd, and AI-saliency, with three columns each showing (1) typical BD-rate improvement on VMAF, (2) compute overhead, and (3) which open-source encoder ships it Figure 3. The four perceptual quantization mechanisms in 2026 open-source encoders. Variance-AQ and mbtree are the classical baseline that ships with x264, x265, and SVT-AV1. Psy-RD is x265's signature feature. AI saliency is the modern addition, available in commercial cloud encoders and in research forks of the open-source projects.

How these four work together inside a real encoder

In a 2026 production encoder, all four ML drop-ins run at different points in the pipeline and feed each other.

The scene-cut detector runs first, in the lookahead pass, before the encoder commits to any block-level decisions. It marks frame boundaries and assigns each new shot a complexity score based on the lookahead's analysis of motion and texture inside the shot. The score becomes the input to two downstream systems: the per-shot bitrate ladder, and the scene-aware GOP structure.

The partition model runs next, at the start of every CTU or superblock in the frame. It reads the block's features plus the scene-complexity score from the lookahead and outputs the set of partition shapes worth exploring. In SVT-AV1 the same network is called recursively at every split level, and the SVM gates the entire search when the "None" mode already explains the block well. ¹⁰

The mode-decision model runs at every leaf of the partition tree. It ranks the candidate prediction modes (intra angles, inter references, transform types) and the encoder runs full RDO on the top K. The exact K is a preset-level decision: in fast presets, K is 1 or 2; in slow presets, K is 5 or 6.

The perceptual quantizer runs after the mode decision but before the actual bit emission, and decides how much QP offset to apply to each block based on the mix of variance-AQ, mbtree, Psy-RD, and (in commercial encoders) saliency. The QP map feeds back into rate control, which adjusts the overall bit budget to keep the buffer model on target.

All four together turn an encoder that would otherwise spend 100% of its time on exhaustive search into one that spends maybe 30% — at a BD-rate cost typically under 3%, often under 1.5%. That is the entire encoder-side story of the last decade, and 2026's "AI encoding" headlines are almost always one or more of these four levers being incrementally improved.

Pipeline diagram showing a 4K frame entering an SVT-AV1-style encoder, with stages labelled lookahead and scene-cut, partition gate, mode-decision rank, perceptual QP map, and entropy coder, and arrows showing the ML decisions feeding into rate control Figure 4. The data flow of a modern encoder with all four ML drop-ins active. The scene-cut detector and partition model do the heavy lifting on encoder speed; the mode-decision and perceptual quantizer do the heavy lifting on encoder output quality.

Where the hardware story fits in

Two separate hardware threads run alongside the software story above, and they are easy to confuse.

The first is encoder ASICs — purpose-built silicon that runs the encoding pipeline directly in hardware. Google's Argos Video Coding Unit (VCU) is the clearest public example: each VCU card carries two Argos ASICs, each with 10 encoder cores, and a single 20-VCU machine replaces multiple racks of CPU-only systems for VP9 encoding. ²⁴ ²⁵ Google reports a 20–33× compute-efficiency improvement over the previous CPU-based system. ²⁴ The original Argos shipped with H.264 and VP9 only; the follow-up generation adds AV1 and ML inference hardware. ²⁶ On the merchant-silicon side, NETINT VPUs (Video Processing Units) have shipped more than 200,000 units in production and encoded more than 1 trillion minutes of video, with a single Quadra T2A card processing 320+ live 1080p streams in a 1RU server. ¹⁸ NETINT's roadmap public statements through 2025 confirm that the next VPU generation adds on-chip AI upscaling and content analysis. ²⁷

The interesting question for product teams is what an encoder ASIC actually does with the ML drop-ins from this article. The answer is: it implements them either as fixed-function pipelines or as small ML accelerators alongside the encoder cores. The partition decision, mode decision, scene-cut detection, and perceptual quantization all run in silicon, but they implement the same algorithms the software encoders use. An ASIC is not a different encoding paradigm; it is the same algorithms in a different package, with a 20–30× density and energy advantage. That advantage matters enormously at hyperscale — YouTube processes the equivalent of millions of CPU-hours per day — but it does not change the encoder's quality envelope.

The second thread is AI-assisted pre-processing — denoising, grain modelling, super-resolution, content analysis — that runs before the encoder. This belongs in the upcoming AI Video knowledge base, not in this article, but it is worth flagging because vendor pitches frequently mix the two threads. A claim like "our AI encoder gives 40% bitrate reduction" needs to be checked: is the AI inside the encoder doing one of the four things in this article, or is the AI in a pre-processor cleaning up the source so the encoder has an easier job? Both are legitimate, but they are very different products with very different operational implications.

Where Fora Soft fits in

Fora Soft has been building video streaming, conferencing, surveillance, e-learning, telemedicine, OTT, and AR/VR systems since 2005. The four ML drop-ins in this article live below the API surface of every modern open-source and commercial encoder we integrate into customer products. We do not train custom partition models or ship our own VPU silicon. What we do is build the systems that use the encoders — choosing the right preset, the right perceptual mode, the right bitrate ladder, the right CDN, and the right player — so that the underlying ML inside the encoder turns into a measurable quality and cost result for the end user. When a customer comes to us with "our encoder is too slow" or "our bitrate ladder is too expensive", the first question we ask is which of the four levers above is set wrong.

A worked example: SVT-AV1 medium-vs-slow preset

A small concrete example shows how the four levers combine. SVT-AV1 ships with presets numbered 0 (slowest, highest quality) to 13 (fastest); the -preset 8 preset is the rough equivalent of x264's medium, and -preset 4 is roughly slow. The difference between the two presets is almost entirely a different configuration of the four levers in this article:

Partition model gating is more aggressive in preset 8 (more blocks skip the full RDO search) and less aggressive in preset 4. The partition CNN's confidence threshold for "skip" is lower in preset 4.
Mode-decision ranking evaluates only the top 2 candidates in preset 8 and the top 5 in preset 4.
Scene-cut detection uses the same algorithm in both presets but with a longer lookahead in preset 4 (more frames are scored before commitment).
Perceptual quantization is on in both presets, with the variance-AQ and mbtree-equivalent terms tuned more aggressively in preset 4.

A 4K, 60 fps, 10-minute Big Buck Bunny encode at the same VBV-budget on a typical server CPU runs roughly 4× slower at preset 4 than at preset 8 — and produces a stream roughly 6–8% smaller at the same VMAF. That 6–8% is the headline result every "AI encoder" marketing slide is comparing itself to. A vendor claiming "30% bitrate savings using AI" on the same content has to either deliver something genuinely better than slow-preset SVT-AV1 (rare, and easy to test) or be comparing themselves to a baseline you would not actually deploy in production (common).

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your ai video encoding plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the AI Inside the Encoder — Cheat Sheet — One-page reference for the four ML drop-ins (partition, mode, scene-cut, perceptual quant), the encoders that ship them, published BD-rate numbers, and the four vendor-pitch questions.

AI Inside the Encoder: ML-Based Mode Decision, Partitioning, Scene-Cut, Perceptual Quantization

Why this matters

What "AI inside the encoder" actually means

The partition tree — the largest compute sink in modern codecs

Mode decision — picking which prediction wins

Common mistake: assuming "AI mode decision" means a giant network

Scene-cut detection — where the I-frames go

Perceptual quantization — the deepest old-school AI

How these four work together inside a real encoder

Where the hardware story fits in

Where Fora Soft fits in

A worked example: SVT-AV1 medium-vs-slow preset

What to read next

Call to action

Related glossary terms

AI Inside the Encoder: ML-Based Mode Decision, Partitioning, Scene-Cut, Perceptual Quantization

Why this matters

What "AI inside the encoder" actually means

The partition tree — the largest compute sink in modern codecs

Mode decision — picking which prediction wins

Common mistake: assuming "AI mode decision" means a giant network

Scene-cut detection — where the I-frames go

Perceptual quantization — the deepest old-school AI

How these four work together inside a real encoder

Where the hardware story fits in

Where Fora Soft fits in

A worked example: SVT-AV1 medium-vs-slow preset

What to read next

Call to action

Related glossary terms

AI in encoding

Banding

Bit allocation

Bitrate

Block

Codec