Hybrid Video Codec Architecture

Why this matters

If you build, buy, or commission video infrastructure, the hybrid architecture is the mental model that makes every other decision tractable. It explains why bitrate ladders work, why CBR and CRF behave differently, why some streams handle motion better than others, and why a hardware encoder costs what it costs. You do not need to write a codec to benefit from understanding the diagram; you need to understand the diagram before you can read a vendor data sheet, scope a transcoding bill, or talk to an encoder engineer without nodding politely at words you do not know.

The one diagram you need

The block-based hybrid codec is the architecture that won. It powers H.261, MPEG-1, MPEG-2, H.263, H.264/AVC, H.265/HEVC, H.266/VVC, VP8, VP9, AV1, and the still-being-finalised AV2. Every standards body has tried other paths — wavelet coders, model-based coders, learned end-to-end neural coders — and every standards body has come back to the same shape, because the shape is hard to beat on rate-distortion versus complexity.

The shape has two ideas glued together. First, the prediction idea: most of a video frame can be guessed from pixels you have already seen, either inside the same frame (the wall next to a doorway looks like the wall already drawn) or in a previous frame (a face moves three pixels to the left, but the face itself is the same). Second, the transform idea: whatever the prediction got wrong, called the residual, lives on a small block of pixels where most of the energy concentrates into a handful of low-frequency coefficients. You then throw away the coefficients the eye is least likely to notice.

That is the hybrid in "hybrid". Prediction handles correlation between samples. Transform handles correlation that remains inside the residual. Compress each in the way it compresses best, instead of trying one tool against the whole signal. The architecture has been called hybrid since the late 1980s, when the H.261 working group adopted it as the recommendation in 1988.

Figure 1. The canonical block diagram of a hybrid video encoder. Every codec from MPEG-2 to AV1 fits inside this shape — only the contents of each box change between generations.

The nine stages, walked end to end

A hybrid encoder is nine stages in series, with a feedback loop. Read it left to right. The names are stable across codecs; what differs is how clever the contents of each box are.

Stage 1 — Partitioning

The encoder receives a frame of pixels — say, 1,920 by 1,080 luma samples plus two chroma planes — and chops it into rectangles called blocks. MPEG-2 used a fixed 16×16 macroblock. H.264 added 8×8 and 4×4 sub-blocks. HEVC introduced the Coding Tree Unit, a 64×64 root that recursively splits into smaller squares. AV1 uses a 128×128 superblock that splits into rectangles. The pattern is the same: large blocks for flat regions, small blocks for detail.

Why bother with variable size? Because prediction works better when a block contains one thing instead of three. A 4×4 block on an eyelash and a 64×64 block on a sky both get good predictions; a single 16×16 block straddling both gets a bad one. Variable partitioning lets the encoder choose locally.

Stage 2 — Prediction

For each block the encoder picks one of two prediction strategies. Intra-prediction synthesises the block from neighbouring samples in the same frame: copy a row of pixels from the top, slide them down, and you have a guess for everything below. Inter-prediction copies a block from a previously coded frame and shifts it by a motion vector: "this block looks like the block 7 pixels right and 3 pixels down in the frame from 40 milliseconds ago".

The encoder tries many prediction modes, scores each one, and keeps the winner. Modern codecs have 95 intra-modes (AV1) and dozens of inter-modes with sub-pixel motion vectors. The decoder is told which mode won, so it can reproduce the same prediction.

Stage 3 — Residual

The residual is the prediction error. Subtract the predicted block from the original block, sample by sample. If the prediction was perfect — for a still sky, often it is — the residual is a block of zeros. If the prediction was bad, the residual still has less energy than the original, because the encoder picked the best prediction it could find.

Most of the bits in a compressed video file describe residuals, not predictions. A typical H.264 sequence spends 60 to 80 percent of its bits on quantized residual coefficients, depending on bitrate. The smaller you make the residual through better prediction, the fewer bits you spend.

Stage 4 — Transform

The residual block is fed through a mathematical transform that maps spatial samples into frequency coefficients. The most common is a discrete cosine transform (DCT) approximation. The transform does not throw away information; it rotates the residual so that energy concentrates in the top-left corner, where the low frequencies live. The bottom-right corner — high frequencies — is usually near zero.

This step is invertible. If you knew every coefficient exactly, you could reconstruct the residual exactly. The transform on its own does not compress. It re-organises.

Stage 5 — Quantization

Quantization is where the codec finally throws information away. Each transform coefficient is divided by a step size and rounded to the nearest integer. A coefficient of 137 divided by step size 8 becomes 17. The decoder will multiply 17 back by 8 and reconstruct 136 instead of 137. The error you just introduced is permanent and is the source of every visible compression artifact in the world.

Step size is controlled by the quantization parameter, usually called QP. Low QP, small step size, high quality, lots of bits. High QP, big step size, low quality, few bits. Every rate-control mode you know — constant bitrate (CBR), variable bitrate (VBR), constant rate factor (CRF) — is a different policy for picking QP block by block, frame by frame.

Stage 6 — Entropy coding

After quantization, most coefficients are zero or small integers. Entropy coding packs that distribution into the smallest possible bitstream by giving short codes to common symbols and long codes to rare ones. H.264 introduced Context-Adaptive Binary Arithmetic Coding (CABAC), which compresses about 14 percent harder than the simpler Huffman-style coder it replaced. AV1 uses a multi-symbol arithmetic coder. The principle is the same: information theory's lossless half of the budget.

Entropy coding is lossless and reversible. The decoder unpacks exactly the symbols the encoder packed in.

Stage 7 — Inverse quantization and inverse transform (inside the encoder)

This is the part that surprises everyone reading the diagram for the first time. The encoder runs the decoder in parallel inside itself. It takes the quantized coefficients, multiplies them back by the step size (inverse quantization), runs the inverse transform, and adds the result to the prediction. The encoder now holds a reconstructed block that is bit-identical to what the decoder will compute on the receiving end.

Why bother? Because the next block over may want to predict from this reconstructed block, and if the encoder predicts from the original pixels while the decoder predicts from the reconstructed pixels, the two will drift apart. The encoder must use what the decoder will see. That feedback loop, sometimes called the reconstruction loop or the coding loop, is what stops compression error from compounding across thousands of frames.

Stage 8 — In-loop filtering

The reconstructed blocks have visible seams along their edges, called blocking artifacts, because each block was quantized independently. A deblocking filter smooths those seams. HEVC added a Sample Adaptive Offset (SAO) filter that nudges pixel values to undo bias. AV1 added the Constrained Directional Enhancement Filter (CDEF) and a self-guided Loop Restoration filter. VVC added the Adaptive Loop Filter (ALF). These filters all run inside the loop — on both encoder and decoder — so the next frame predicts from filtered pixels.

In-loop filters do not magically make video look better; they make the reference that future frames predict from look better, which compounds across a Group of Pictures (GOP) and reduces overall bitrate at matched quality by roughly 5 to 15 percent depending on content.

Stage 9 — Reference frame buffer

Filtered reconstructed frames are stored in a Decoded Picture Buffer, the codec's short-term memory. Inter-prediction in future frames will reach back into this buffer to find a block that matches. H.264 allows up to 16 reference frames; HEVC and AV1 likewise. The buffer is bounded; older frames get evicted.

The buffer plus the prediction stage is the architecture's superpower. A scene of someone talking against a still background only needs to send the bits for the head moving, frame after frame, because every other block can be predicted as "the same as the block in the previous frame", residual near zero.

A concrete numeric example

Take a single 8×8 luma block from a flat blue sky. Original pixel values, after subtracting the average luma of the block to centre around zero, are mostly small numbers — maybe a few +2 and −1 values caused by camera sensor noise, the rest zeros. We will work through the loop with rounded numbers.

Original residual (after intra-prediction subtraction), 8×8 block:

 1  0  0 -1  0  0  1  0
 0  1 -1  0  0  1  0  0
 0  0  1  0 -1  0  0  1
-1  0  0  0  0  0  1  0
 0  0  1  0  0 -1  0  0
 1  0  0 -1  0  0  0  1
 0  1  0  0  1  0 -1  0
 0  0  0  1  0  0  0  0

Sum of squared values across the 64 samples ≈ 21. Mean energy per sample ≈ 0.33.

The encoder runs a forward 2-D DCT. Because the residual is near-random noise, most DCT coefficients are small but non-zero. The DC coefficient (top-left) is roughly the block's mean, which we subtracted, so it sits at about +0.5 after rounding. The remaining 63 AC coefficients hover between +1 and −1.

Now quantize. With QP set to 27, a typical mid-bitrate H.264 quantization step for AC coefficients is about 14 (the relationship is Qstep ≈ 2^((QP−4)/6), so Qstep(27) ≈ 13.5). Every coefficient near ±1 divides by 14, rounds, and becomes 0. The DC at +0.5 also rounds to 0 because absolute values below half the step go to zero.

Quantized block: all 64 entries are 0.

Entropy coding represents 64 zeros with a tiny End-Of-Block symbol and a sign bit or two for the mode. The total cost is about 6 bits.

A naïve uncompressed encoding of that same 8-bit 8×8 luma block would have cost 64 × 8 = 512 bits. We compressed it by a factor of about 85:1, and on the decoder side it will reconstruct as a perfectly flat block — visually identical to the source because the human eye was never going to see those sub-quantum variations anyway.

Now imagine the same block in detailed grass. The residual after prediction is far from zero. After DCT the energy spreads across many coefficients. After quantization at QP 27, perhaps 12 to 20 coefficients survive as non-zero integers. Entropy coding spends maybe 60 to 90 bits. Compression ratio: about 6:1 to 8:1. Same architecture, same QP, very different bit cost — because the content was harder.

This is why bitrate fluctuates frame by frame in VBR and why per-title encoding works: the architecture spends bits where the content has bits to spend.

The decoder is the encoder, minus four boxes

A decoder is simpler than an encoder. It does not search prediction modes; it is told which one to use. It does not pick QP; it reads it from the bitstream. It does not run rate control or motion estimation. The decoder just reverses the bottom half of the encoder's loop — entropy decode, dequantize, inverse transform, add prediction, in-loop filter — and writes the result into the same Decoded Picture Buffer the encoder maintains.

Block diagram of a hybrid video decoder showing entropy decoder, inverse quantizer, inverse transform, adder fed by the same prediction unit type as the encoder, in-loop filter, and decoded picture buffer feeding back into prediction. Figure 2. The decoder is the encoder's lower path. No mode decision, no rate control, no motion estimation. Simpler, faster, easier to put on a phone.

The asymmetry is on purpose. A live encoder may run at 1× real time on a heavy server-class GPU or a NETINT Quadra ASIC. The decoder running on a $20 set-top box must keep up with that same stream. Standards bodies design decoder complexity downward at every generation; encoder complexity upward at every generation. AV1's reference encoder is roughly 100 times slower than its decoder; HEVC's reference encoder is roughly 50 times slower. This gap is structural, not a bug.

Why every modern codec uses the same shape

The hybrid block-based architecture became the default because the alternatives lost in real-world rate-distortion testing.

Wavelet coders (Motion JPEG 2000, Dirac, SMPTE VC-2) handle smooth gradients well and avoid blocking artifacts, but their motion-compensation story is awkward — wavelets are not naturally block-aligned, and inter-frame coding has to fight the transform. Wavelet coders survive in production niches (broadcast contribution at very high bitrates), not in delivery.

Model-based coders describe a scene as objects with parameters — a head model, a lighting model, a body pose. At extreme low bitrates they outperform hybrid coders for video calls. Outside that niche they collapse when the scene contains anything the model has not been trained on.

End-to-end neural codecs replace the entire pipeline with a learned autoencoder, often a variational autoencoder with hyperprior entropy coding. Recent papers (Mentzer et al. at Google, Yang et al., Lu et al.) match HEVC or even VVC on PSNR and beat it on subjective MOS in controlled evaluations. They lose on three practical axes that matter to deployment: deterministic decoder behaviour across implementations, computational cost on existing silicon, and predictable failure modes on unseen content. They are the most exciting research direction in the field and not yet a credible replacement for the hybrid architecture in commercial deployment.

The hybrid architecture, meanwhile, has gone from 50 Mbps for SD video in MPEG-2 to under 2 Mbps for 4K HDR in AV1 — a roughly 50-fold compression improvement across 30 years, while the diagram itself has not changed. Each generation refined the contents of the boxes.

How each codec refines the boxes

Here is the same diagram, four times, across the codec generations most people deploy. The shape is identical. What changed?

Stage	MPEG-2 (1994)	H.264 / AVC (2003)	H.265 / HEVC (2013)	AV1 (2018)
Partition	Fixed 16×16 macroblock	16×16 macroblock + 8×8, 4×4 sub-blocks	CTU 64×64 → quad-tree to 8×8	Superblock 128×128 → rect + AB partitions
Intra-pred	DC only (I-frame)	9 modes, 4×4 / 16×16	35 modes, angular	56 directional + smooth + Paeth + CfL = 95 modes
Inter-pred	Half-pel, 1 ref	Quarter-pel, up to 16 refs, B-frames as refs	Quarter-pel, up to 16 refs, weighted prediction	Eighth-pel, 7 refs, compound prediction, warped motion, OBMC
Transform	8×8 DCT	4×4 / 8×8 integer DCT	4×4 to 32×32 DCT + DST	4×4 to 64×64 DCT/ADST/flipADST/Identity
Quantization	Linear, frequency-weighted	Linear, frequency-weighted matrix	Same, with custom matrices	Linear, broad range, per-plane
Entropy	VLC	CAVLC or CABAC	CABAC only	Multi-symbol arithmetic with explicit probability updates
In-loop filter	None	Deblocking	Deblocking + SAO	Deblocking + CDEF + Loop Restoration
Reference buffer	1 or 2	Up to 16	Up to 16	Up to 8 (with versatile reference management)
Bitrate, 1080p30 typical	8–15 Mbps	3–5 Mbps	1.5–3 Mbps	0.8–1.6 Mbps

The pattern is consistent. Each generation adds finer block partitioning, more prediction modes, more transform options, smarter in-loop filters, and richer entropy contexts. Decoder complexity grows linearly with these additions. Encoder complexity grows much faster because mode decision must search a larger space. Bitrate, at matched quality, halves roughly every 7 to 10 years.

Side-by-side comparison of MPEG-2, H.264, HEVC, and AV1 showing the same nine-stage block diagram with the contents of each stage labelled. Stage shapes identical across columns; cell text differs to reflect the increasing toolset per generation. Figure 3. The shape never moved. The contents of the boxes got richer.

The encoder's secret: rate-distortion optimisation

So far we have described the architecture without saying how the encoder actually picks which prediction mode, which block size, which transform, which QP. The mechanism is called Rate-Distortion Optimisation (RDO), and it is the place modern encoders spend most of their CPU.

For each block the encoder considers many combinations of choices. For each candidate it estimates the cost in bits (rate, R) and the quality loss in some distortion metric (distortion, D), then computes a Lagrangian cost:

J = D + λ · R

Here λ (lambda) is a Lagrange multiplier the encoder picks based on the target QP. Low QP → small λ → distortion is the dominant term → encoder picks high-quality modes. High QP → large λ → bits dominate → encoder picks cheap modes.

A typical H.264 encoder may evaluate 50 to 200 mode candidates per 16×16 macroblock; an HEVC encoder, 500 to 2,000 per 64×64 CTU; an AV1 encoder, 5,000 to 50,000 per 128×128 superblock. The vast majority of an encoder's runtime is spent inside this search. This is why a software AV1 encoder at libaom's cpu-used=0 runs at roughly 0.01× real time on a fast CPU, and why the ffmpeg AV1 SVT-AV1 preset 13 is 1,000× faster but 30 to 50 percent less efficient on bitrate at the same quality.

The whole point of the hybrid architecture, from an engineering perspective, is that it gives the encoder a well-defined search space where each tool has predictable behaviour. You can swap heuristics in and out, tune for content type, accelerate on a GPU or ASIC, and you still produce a standard-compliant bitstream that any decoder can play.

Common mistake: confusing the architecture with the standard

The diagram above is the architecture. The standard — H.264, H.265, AV1, whatever — is the bitstream syntax that a compliant decoder must accept. Two things follow from that distinction.

First, the standard does not tell you how to build an encoder. H.264 specifies what a bitstream looks like; it does not specify how the encoder must decide which modes to pick. This is why two H.264 encoders (x264, OpenH264, Apple's VideoToolbox, NVIDIA NVENC) produce wildly different files for the same input at the same target bitrate, even though they are all "H.264". The architecture they use internally is the hybrid loop; the policy they apply inside the boxes differs.

Second, you can ship a new encoder for an old standard and still ship gains. x264 in 2026 is roughly 20 percent more bit-efficient at matched quality than x264 in 2014, with no change to the H.264 specification. The encoder learned to use the standard's tools more cleverly. This is also why ffmpeg's libsvtav1 picked up large efficiency gains between 2022 and 2025 without a new standard — only encoder mode-decision refinements.

Where Fora Soft fits in

We build video pipelines for video streaming, OTT, video surveillance, video conferencing, e-learning, and telemedicine. In every one of those verticals, an architectural understanding of how codecs work decides what your bill, your latency, and your viewing quality look like. We pick codecs and tune encoder presets based on the content (conferencing favours low-latency, low-complexity profiles; OTT favours high-efficiency multi-pass), and we lean on hardware acceleration when the encoder cost would otherwise dominate the unit economics. Surveillance specifically benefits from intra-refresh patterns inside the hybrid loop that conferencing tools also use to recover from packet loss.

A practical reading order for the rest of Block 4

This article is the map. The remaining articles in Block 4 zoom into individual boxes:

Intra-frame coding walks the prediction stage when there is no previous frame to lean on.
Inter-frame coding and motion estimation walks the prediction stage when there is.
GOP structure: I, P, B-frames, open vs closed GOP explains how the reference frame buffer is organised in time.
Block-based prediction: MBs, CTUs, SBs and superblocks zooms into the partitioning stage.
Transform coding and Quantization cover the two stages where lossy compression actually happens.
Entropy coding in detail explains the lossless final pack.
In-loop filtering covers the post-reconstruction smoothing.
Mode decision and RDO and Rate control are the policy layer on top of the architecture.

Read them in any order once you have the diagram from this article in your head.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your hybrid video codec architecture plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Hybrid Video Codec Architecture Cheat Sheet — One-page A4 reference: the nine-stage block diagram, what changed in each codec generation, and the most common encoder/decoder gotchas.

References

Sullivan, G. J. and Wiegand, T. Video Compression — From Concepts to the H.264/AVC Standard. Proceedings of the IEEE, 93(1), 18–31, 2005. https://ieeexplore.ieee.org/document/1369691
Wiegand, T., Sullivan, G. J., Bjøntegaard, G. and Luthra, A. Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), 560–576, 2003. https://ieeexplore.ieee.org/document/1218189
Sullivan, G. J., Ohm, J.-R., Han, W.-J. and Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE TCSVT, 22(12), 1649–1668, 2012. https://ieeexplore.ieee.org/document/6316136
Bross, B. et al. Overview of the Versatile Video Coding (VVC) Standard and its Applications. IEEE TCSVT, 31(10), 3736–3764, 2021. https://ieeexplore.ieee.org/document/9503377
Chen, Y. et al. An Overview of the Coding Tools in AV1: the First Video Codec from the Alliance for Open Media. APSIPA Transactions on Signal and Information Processing, 9, e6, 2020. https://www.nowpublishers.com/article/Details/SIP-2020-0014
ITU-T Recommendation H.261 (1988). Video codec for audiovisual services at p × 64 kbit/s. https://www.itu.int/rec/T-REC-H.261
ITU-T Recommendation H.264 (V14, August 2021). Advanced video coding for generic audiovisual services. https://www.itu.int/rec/T-REC-H.264
ITU-T Recommendation H.265 (V8, 2024). High efficiency video coding. https://www.itu.int/rec/T-REC-H.265
ITU-T Recommendation H.266 (V3, 2023). Versatile video coding. https://www.itu.int/rec/T-REC-H.266
Alliance for Open Media. AV1 Bitstream and Decoding Process Specification, v1.0.0 Errata 1. https://aomediacodec.github.io/av1-spec/
Wien, M. High Efficiency Video Coding — Coding Tools and Specification. Springer, 2015. ISBN 978-3-662-44275-3.
Richardson, I. E. The H.264 Advanced Video Compression Standard. Wiley, 2nd ed., 2010. ISBN 978-0-470-51692-8.
Bjøntegaard, G. Calculation of Average PSNR Differences Between RD Curves. ITU-T VCEG-M33, Austin, 2001. https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc
Ohm, J.-R. et al. Comparison of the Coding Efficiency of Video Coding Standards — Including High Efficiency Video Coding (HEVC). IEEE TCSVT, 22(12), 1669–1684, 2012. https://ieeexplore.ieee.org/document/6317156
Netflix Technology Blog. Per-Title Encode Optimization, 2015 (still the canonical statement of per-title bitrate ladders on a hybrid encoder). https://netflixtechblog.com/per-title-encode-optimization-7e99442b62a2
Bross, B., Wang, Y.-K., Ye, Y., Liu, S., Chen, J., Sullivan, G. J. and Ohm, J.-R. Developments in International Video Coding Standardization After AVC, With an Overview of Versatile Video Coding (VVC). Proceedings of the IEEE, 109(9), 1463–1493, 2021. https://ieeexplore.ieee.org/document/9499317
Mentzer, F., Toderici, G., Minnen, D. et al. Neural Video Compression — A Survey. Foundations and Trends in Signal Processing, 16(3), 2024. (Context on neural alternatives to the hybrid architecture.)
Han, J. et al. A Technical Overview of AV1. Proceedings of the IEEE, 109(9), 1435–1462, 2021. https://ieeexplore.ieee.org/document/9363033
Bitmovin Video Developer Report 2025. https://bitmovin.com/video-developer-report-2025/
NETINT State of Video Encoding 2025 white paper. https://netint.com/

Hybrid Video Codec Architecture

Why this matters

The one diagram you need