Published 2026-05-17 · 18 min read · By Nikolay Sapunov, CEO at Fora Soft

Why this matters

If you run a video service, the single biggest reason a 4K stream now costs half what it did in 2019 is the transform stage of the codec — not the network, not the player, not the screens. A product manager who can explain "the encoder spreads the error across a few large coefficients and throws away the small ones" can read a codec datasheet and ask the right question: what transform sizes does it use, what transform types, is there a secondary transform. A founder who can challenge a vendor's "30% bitrate saving" claim by asking which transforms they tested on which content saves a quarter on the storage and CDN bill. A technical lead who understands why VVC uses DCT-II, DCT-VIII, and DST-VII can build a transcoder that beats a vendor box at half the cost.

The block that needs to leave the encoder

Every modern codec works on small rectangles of pixels — we covered them in our article on block-based prediction. For each block, the encoder first guesses the pixels from neighbours inside the same frame or from a similar block in an earlier frame; we covered those steps in intra-frame coding and inter-frame coding and motion estimation. The guess is almost never perfect. The pixel-by-pixel difference between the guess and the truth is called the residual.

The residual is a small image of mistakes. On a typical sports clip, the residual block has values clustered around zero with occasional spikes wherever the guess missed an edge or a moving object. Sending the residual as raw numbers is wasteful — most of its energy lives in slow, smooth gradients, and a handful of coefficients can describe that energy if you change your point of view from "pixels on a grid" to "weights of pre-defined patterns".

That change of point of view is what a transform does. It takes the residual block and re-expresses it as a sum of fixed wave-shaped patterns called basis functions. Each basis function has a fixed frequency — slow, medium, fast — and a fixed direction — flat, vertical, diagonal. The output of the transform is one number per basis function, telling the encoder "how much" of that pattern is in the block. We call those numbers coefficients.

The trick is that real-world residuals look mostly like a few low-frequency patterns plus a long tail of near-zero high-frequency patterns. A well-chosen transform pushes nearly all of the energy into the top-left corner of the coefficient grid — the low-frequency, smooth-pattern corner — and leaves the rest of the grid full of small numbers that quantization (covered in the next article in this block, quantization) can throw away with little visible damage. This concentration of energy into a few coefficients is called energy compaction, and it is the entire reason transforms exist.

Diagram showing a residual block of 4 by 4 pixels with mixed values, transformed into a coefficient block where almost all the energy is in the top-left corner and the rest is near zero. An arrow labelled 'transform' connects them; a second arrow labelled 'quantize and code' leaves from the coefficient block toward a stylised bitstream. Figure 1. The transform takes a pixel-domain residual and re-expresses it as transform coefficients. Energy concentrates in the top-left corner, which is what makes the next stage — quantization — cheap.

What the DCT actually is

The Discrete Cosine Transform, or DCT, is a recipe for breaking any block of numbers into a sum of cosine waves at fixed frequencies. The mathematician Nasir Ahmed proposed the idea in 1972 at the University of Texas at Arlington, then he and his students T. Raj Natarajan and K. R. Rao published the formal algorithm in a January 1974 paper. The first video application followed in 1975, when John A. Roese and Guner S. Robinson used the DCT inside a motion-compensated inter-frame coder and reported it could squeeze image data down to 0.25 bits per pixel — a number that, fifty years later, still shapes the design of every codec on the internet.

The DCT works one row at a time and then one column at a time. For a 4×4 block, the transform writes each row as a weighted sum of four cosine waves: one wave that does not change at all (the average), one wave that goes from positive to negative across the row (the first cosine), one wave with two zero crossings, and one wave with three. The same recipe runs on columns. The output is a 4×4 grid of coefficients where the top-left number is the DC coefficient — the average pixel value, named after the direct-current term in electrical engineering — and the rest are AC coefficients at progressively higher frequencies.

Why cosines and not some other set of waves? Because for the kinds of signals you find in natural images and video residuals, cosines come within a fraction of a percent of the best possible transform — a transform called the Karhunen-Loève Transform, or KLT, which is provably optimal for energy compaction but has to be re-computed for every block, which is too slow for a video encoder. The DCT is a fixed substitute that is good enough for natural content and cheap enough to run a billion times per second on a phone.

The DCT comes in eight numbered variants in the mathematical literature, called DCT-I through DCT-VIII. The variant used in video is DCT-II — same recipe used in JPEG since 1992. Modern codecs add a second variant, DCT-VIII, which we'll meet later when we get to VVC and AV1.

Diagram showing 16 small 4 by 4 thumbnails arranged in a 4 by 4 grid. Each thumbnail visualises one DCT basis function: top-left is a flat grey square, top-right is a horizontal striped pattern, bottom-left is a vertical striped pattern, bottom-right is a fine checker pattern. Frequency increases left to right and top to bottom. A caption underneath reads 'Any 4 by 4 block of pixels can be written as a weighted sum of these 16 patterns.' Figure 2. The 16 basis functions of a 4×4 DCT-II. Every residual block is expressed as a weighted sum of these 16 patterns; the weights are the coefficients the codec actually transmits.

A worked example — one block, one transform

To make the math visible, take a 4×4 residual block where every value is 8:

8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8

The block has no detail — only an average. If we feed it through a 4×4 DCT-II, only one coefficient is non-zero: the DC coefficient, with a value proportional to the block average times the block size. Every other coefficient is exactly zero. We started with 16 numbers of equal weight and ended with one number that carries all the energy. The encoder transmits one number (a few bits), the decoder rebuilds the block, and the picture is exact.

Now mix the residual with a vertical edge:

 8  8 -8 -8
 8  8 -8 -8
 8  8 -8 -8
 8  8 -8 -8

The DCT-II spreads the energy across two coefficients — the DC term is now zero (the average is zero) and the (0,1) coefficient — first column, second row, the horizontal-frequency-one pattern — carries the edge. Two non-zero coefficients out of sixteen, and quantization can knock out the rest of the noise without anyone seeing a thing.

Now a noisy block — the residual after a missed motion prediction:

 4 -3  2  1
-2  5 -1  3
 1 -2  4 -1
-3  2 -1  5

The DCT-II output spreads across many coefficients, with the largest ones still clustered in the top-left. Energy compaction is weaker for noise than for smooth signal — which is exactly the right behaviour, because random noise has nowhere to hide and the encoder has to spend bits.

The reason every codec uses the DCT for the "default" transform is that natural-image residuals look like the first two examples ninety percent of the time, not like the third.

Integer transforms — why decoders never disagree

A standards-grade DCT-II uses floating-point cosine values, and floating-point math is not bit-exact across different hardware. A phone, a TV, and a browser running on a laptop must all decode the same bitstream to the same pixels, frame after frame, for hours. If any of them computed the inverse DCT slightly differently — even at the last decimal — the small error would accumulate frame after frame and the decoders would drift apart. This was a real bug in early MPEG codecs, called inverse transform drift, and it caused colour banding and ghosting on long sequences.

H.264, finalised in 2003, fixed it for good by replacing the floating-point DCT with an integer transform — a scaled approximation of the DCT-II that uses only addition, subtraction, and right-shift. The matrix entries are small whole numbers (mostly 1 and 2) chosen so the transform behaves like the real DCT-II to within a fraction of a coefficient, but with every step exactly defined in integer arithmetic. The price is a slight loss of energy compaction (the integer transform is about 0.1 dB worse than a true DCT-II on average); the gain is that every decoder on Earth produces identical output, byte for byte.

H.264's transform layer specifies four pieces. The 4×4 core transform is the integer DCT-II approximation we just described. The 4×4 Hadamard transform uses only add and subtract — even simpler than the core — and is applied to the DC coefficients of 16×16 intra-predicted luma blocks to squeeze one more percent out of flat regions. The 2×2 Hadamard transform does the same for chroma DC. And the 8×8 integer transform, added in the High profile in 2005, is a larger version of the core transform for blocks with smoother content where 4×4 is too small to be efficient.

A second design choice in H.264 is that the transform is split into a core part and a scaling part. The core part runs at encode and decode; the scaling part is folded into the quantiser. This split keeps both the encoder and decoder simple and is why H.264 transforms can be done with just adds, subtracts, and shifts — multiplications are absorbed elsewhere.

Every codec since H.264 has followed the same pattern: define integer matrices, prove the inverse round-trips exactly, fold the scaling into the quantiser. The numbers in the matrices differ, but the discipline is universal.

Pipeline diagram, left to right. Stage 1: a 4 by 4 grid of pixel residual values. Stage 2: the integer transform shown as a matrix multiplication block, with the matrix entries 1, 2, 1, 1 visible. Stage 3: an intermediate grid. Stage 4: the same integer transform applied to columns. Stage 5: a coefficient grid with most energy in the top-left. Beneath the pipeline, a small note reads 'All operations are addition, subtraction, and right-shift only. No multiplications. Every decoder gets bit-exact identical output.' Figure 3. H.264's integer transform pipeline. The 2D transform is two 1D passes — rows, then columns — each using a tiny integer matrix. The split removes inverse-transform drift across decoders forever.

DST and ADST — better tools for edges

The DCT-II is good for smooth content where the residual decays gently away from the centre. It is not as good for residual blocks that look like a one-sided slope — pixels small on one side, large on the other — which is exactly the shape an intra-prediction error takes. Intra prediction copies pixels from the top and left edges of the block; the predictor is nearly perfect right next to those edges and gets progressively worse as you move away. The residual is small near the top-left and large near the bottom-right.

For that shape, the Discrete Sine Transform, or DST, is mathematically a better fit. The DST is the same idea as the DCT but uses sine waves that vanish at one end of the block instead of cosines that are flat at one end. Its basis functions are tilted in a way that compacts an asymmetric residual into fewer coefficients.

HEVC, finalised in 2013, was the first standard to ship a DST in production. HEVC defines a 4×4 integer DST-VII transform — the seventh of the DST family — that the encoder must apply, by rule, to 4×4 luma residual blocks inside intra-predicted regions. The HEVC authors restricted DST-VII to 4×4 luma intra because that is where the win is biggest; on larger blocks or chroma the gain shrank below the cost of supporting two transform types side by side. The Joint Collaborative Team on Video Coding measured a roughly 1% bitrate reduction at constant quality from this rule alone (Sze et al., 2014).

The next codec, VP9, finalised by Google in 2013, took the same observation and went further. VP9 introduced the Asymmetric Discrete Sine Transform, or ADST, as a sibling to the DCT. ADST is closely related to DST-VII but is implemented in a way that maps cleanly to integer arithmetic and integrates with VP9's intra-prediction modes. The VP9 encoder uses ADST on intra residuals that come from a directional predictor, and DCT on inter residuals and on flat directions. The decoder picks the transform based on signalled flags.

The reason ADST helps is the same reason DST-VII helps in HEVC — its basis functions tilt in the same direction as the typical intra residual, so the energy compaction is tighter. The reason VP9 calls it "asymmetric" is that the basis functions are not symmetric around the centre of the block; they grow toward one side, matching the asymmetric shape of the residual.

AV1 — sixteen transforms and a learned chooser

AV1, finalised by the Alliance for Open Media in March 2018, kept the idea of multiple transform types and pushed it to its logical limit. AV1 defines four 1D transform kernels: DCT, ADST, FLIPADST, and IDTX.

DCT is the standard cosine transform we already met. ADST is the asymmetric sine transform inherited and refined from VP9. FLIPADST is the flipped version of ADST — same basis functions, run from bottom to top instead of top to bottom — designed to fit residuals whose energy slopes the other way. IDTX, short for identity transform, leaves the input unchanged. IDTX is useful for screen content (text, line art, sharp computer-generated graphics) where the residual already looks like a small set of isolated spikes and a real frequency transform would smear them.

AV1 pairs these four kernels in both the horizontal and vertical directions, giving up to 16 distinct 2D transforms per block. For 4×4 and 8×8 blocks, all 16 combinations are available; for 16×16, 32×32, and 64×64, the encoder picks from a reduced set to keep the rate-distortion search tractable. The transform type is signalled in the bitstream per block — one of the reasons AV1 bitstreams have so many flags per block compared to H.264.

The cost of having sixteen transforms is encoder complexity. The encoder has to try each candidate, run quantization, estimate the bit cost, and pick the winner — a process called rate-distortion optimization, or RDO, that we cover in mode decision and RDO. The benefit is roughly 5–7% bitrate reduction at the same quality on natural content compared to a single-transform baseline, climbing to 15–20% on screen content where IDTX shines.

Transform Best for Available in Block sizes
DCT-II Smooth residuals, inter H.261, MPEG-2, H.264, HEVC, VP9, AV1, VVC 4×4 to 64×64
DST-VII 4×4 intra luma HEVC, VVC 4×4 (HEVC), 4×4 to 32×32 (VVC)
ADST Directional intra VP9, AV1 4×4 to 64×64
FLIPADST Reverse-slope intra AV1 4×4 to 64×64
IDTX (identity) Screen content, isolated spikes AV1 (and VVC TS mode) 4×4 to 32×32
DCT-VIII Smaller intra blocks VVC 4×4 to 32×32

Table 1. The transform menu of modern codecs. Each codec generation adds either a new kernel or a new block size — the underlying idea is unchanged since 1972.

VVC — multiple transform selection and a secondary stage

VVC, ratified as ITU-T H.266 in July 2020, redesigned the transform stage with two parallel innovations: Multiple Transform Selection (MTS) and the Low-Frequency Non-Separable Transform (LFNST).

MTS gives the VVC encoder three transform types — DCT-II, DCT-VIII, DST-VII — and lets it pick the best one for each block independently in the horizontal and vertical directions. Block sizes go from 4×4 to 64×64 for DCT-II and up to 32×32 for DCT-VIII and DST-VII. That is the same idea AV1 ships, with a slightly different menu of kernels.

LFNST is the part that makes VVC different. It runs after the primary transform and before quantization, on the low-frequency coefficients only (the top-left 4×4 or 8×8 corner of the primary transform output). LFNST is non-separable — it does not split into a row pass and a column pass; it applies a single 2D matrix to a flattened 16- or 64-element coefficient vector. Non-separable transforms can capture correlations the row-then-column structure of a separable DCT misses, particularly in directional intra residuals where energy lies along a diagonal.

The price is that non-separable transforms cost more arithmetic. LFNST limits the damage by running only on the low-frequency corner — at most 64 coefficients per block — and by storing only a handful of small matrices selected by the intra prediction mode. The Joint Video Experts Team measured an additional ~1–2% bitrate reduction from LFNST alone on intra content (Wang et al., 2021).

LFNST has one important interaction: when LFNST is on, MTS is forced to DCT-II. The two innovations are not stacked — they alternate, chosen per block.

Three-stage flow diagram for VVC. Stage 1, labelled 'Residual', shows a 16 by 16 grid of pixel-domain residual values. Stage 2, labelled 'Primary transform (MTS)', shows the encoder picking between DCT-II, DCT-VIII, and DST-VII for both rows and columns. Stage 3 has two branches: branch A goes directly to quantization; branch B routes the top-left 8 by 8 corner through an LFNST box labelled 'Non-separable, intra only, picks from a small library' and then to quantization. Underneath, a footnote reads 'When LFNST is on, the primary transform must be DCT-II. The encoder chooses A or B per block.' Figure 4. The VVC transform stage. Multiple Transform Selection picks the primary kernel; the optional Low-Frequency Non-Separable Transform decorrelates the low-frequency coefficients further before quantization.

AV2 — data-driven kernels and intra-inter secondary transforms

AV2, AOMedia's successor to AV1, is in active draft as of May 2026. The transform stage is one of the areas with the biggest planned changes. The published technical reports point to four directions.

First, AV2 redesigns the primary transform kernels for DCT, DST, and ADST so the integer matrices line up better with the residual statistics measured on large modern video corpuses. Second, AV2 introduces data-driven transforms (DDTs) — kernels trained offline on real video and embedded in the standard, similar in spirit to the learned components of a neural codec but encoded as fixed matrices so decoders remain deterministic. Third, AV2 adds intra/inter secondary transforms (IST), a non-separable second stage analogous to VVC's LFNST but available for both intra and inter blocks. Fourth, AV2 expands the transform partitioning framework so the encoder can choose finer transform-block splits inside a coded block, matching the transform size to the local residual structure (Nadir et al., arXiv:2601.02712, 2026).

The expected combined gain on the transform stage alone is in the 3–6% bitrate reduction range over AV1 at constant quality on natural content, with larger wins on screen content. The total AV2 vs AV1 target, including all coding tools, is 30–40% bitrate reduction — a moving figure as the standard is not yet frozen.

How the transform interacts with everything else

The transform stage does not stand alone. Three downstream stages depend on its choices.

Quantization divides each coefficient by a step size and rounds. Larger step sizes mean fewer surviving non-zero coefficients, which means fewer bits. The transform decides what the input to quantization looks like; the better the energy compaction, the more coefficients can be safely zeroed out. We unpack this in the next article: quantization.

Reordering and run-length coding scan the coefficient grid in a zig-zag pattern that follows the rising frequency from the top-left corner outward. The aim is to collect the long tail of zeros into one run, which the entropy coder then compresses to almost nothing. Different transforms produce different zig-zag patterns; for ADST and FLIPADST, the scan order is rotated to match the basis functions' asymmetry. We covered this in reordering, zig-zag, and run-length coding.

Entropy coding — CABAC in H.264/HEVC/VVC, an arithmetic coder in AV1 — converts the quantized coefficients and their positions into the final bits. Compact, mostly-zero coefficient grids feed the entropy coder content it can compress aggressively; spread-out grids do not. We covered this in entropy coding in detail.

The four stages — transform, quantization, reordering, entropy — are designed as a chain. Changing one without the others gives back its gain; that is why each new codec generation redesigns all four together rather than tweaking a single block.

Linear pipeline diagram with five rectangles connected by arrows. Box 1: 'Residual', a small 8 by 8 grid of mixed numbers. Box 2: 'Transform', with sub-labels DCT/DST/ADST. Box 3: 'Quantize', with a small divide-and-round icon. Box 4: 'Reorder (zig-zag)', with a serpentine arrow. Box 5: 'Entropy code', terminating in a 'bitstream' symbol. Underneath each box, a one-line caption explains what the box buys: 'concentrate energy', 'drop the small stuff', 'collect the zeros', 'spend bits where they matter'. Figure 5. The coding chain in every modern codec. Transform, quantize, reorder, entropy — four stages that have to be redesigned together to extract a real coding gain.

A common mistake — using a high QP and blaming the transform

Engineers new to video tuning sometimes see banding or blockiness in a transcode and conclude the transform is to blame. It almost never is. The transform itself is mathematically lossless when its scaling factor is one and its rounding is symmetric — every coefficient is preserved exactly. Loss happens at the next stage, quantization, when the encoder divides coefficients by a step size and rounds.

If the picture looks blocky, the transform did its job and the quantizer was set too aggressive. If you see ringing around edges, the transform produced legitimate high-frequency coefficients and the quantizer killed them — Gibbs phenomenon, predictable, fix it by lowering QP or raising the bit budget on edges. If you see colour banding, the transform handed clean coefficients to a quantizer with too few output levels — fix it with a higher bit depth or 10-bit encode. The transform is the photographer; the quantizer is the bin that throws away the bad shots. Aim your fixes at the bin.

Where Fora Soft fits in

Across the streaming, surveillance, conferencing, and AR/VR products we ship, the transform stage choice shapes a third of the picture-quality result and a quarter of the encode-time budget. In live OTT pipelines we lean on HEVC's DST-VII for intra-heavy content and on AV1's full sixteen-transform menu for premium VOD where encode time is paid for once and recouped over millions of plays. In WebRTC SFUs we keep the H.264 4×4 integer transform because its bit-exact decoders and tiny per-block cost matter more than the last few percent of compression. In surveillance composites we use VVC's MTS with LFNST disabled because the gain from non-separable transforms collapses on cropped sub-frames. The point is that the transform is a knob — set per workload — not a default to inherit from a vendor preset.

What to read next

Talk to us / See our work / Download

References

  1. Ahmed, N., Natarajan, T. R., Rao, K. R. "Discrete Cosine Transform." IEEE Transactions on Computers, January 1974. https://www.cse.iitd.ac.in/~pkalra/col783-2017/DCT-History.pdf
  2. Roese, J. A., Robinson, G. S. "Combined spatial and temporal coding of digital image sequences." Proceedings of SPIE, 1975. (First DCT-based inter-frame video coder.)
  3. ITU-T. "Recommendation H.264: Advanced video coding for generic audiovisual services." Edition 14, 2024. Sections 8.5 (Transform), 8.6 (Scaling).
  4. Vcodex / Richardson, I. "H.264/AVC 4x4 Transform and Quantization." Vcodex, 2024. https://www.vcodex.com/h264avc-4x4-transform-and-quantization
  5. Sze, V., Budagavi, M., Sullivan, G. (eds.). "High Efficiency Video Coding (HEVC): Algorithms and Architectures." Chapter "Transform and Quantization." Springer, 2014.
  6. Han, J. et al. "A Technical Overview of AV1." arXiv:2008.06091, 2020. https://arxiv.org/pdf/2008.06091
  7. AOMedia. "AV1 Bitstream & Decoding Process Specification." Section on transforms. https://aomediacodec.github.io/av1-spec/
  8. Mukherjee, D. et al. "A Butterfly Structured Design of the Hybrid Transform Coding Scheme." Google Research, 2013. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41418.pdf
  9. OTTVerse. "Explanation of the Block-based Transforms in VVC." 2023. https://ottverse.com/block-based-transforms-in-vvc-versatile-video-coding/
  10. Suverov, K. "Low-Frequency Non-Separable Transform (LFNST) in VVC." Vicuesoft, 2021. https://medium.com/vicuesoft-techblog/low-frequency-non-separable-transform-lfnst-in-vvc-6facc822be8f
  11. Nadir, Z. et al. "Transform and Entropy Coding in AV2." arXiv:2601.02712, 2026. https://arxiv.org/abs/2601.02712
  12. ITU-T. "Recommendation H.266: Versatile video coding." Edition 2, 2022. Sections on primary transform, MTS, and LFNST.
  13. Nasir Ahmed. "How I Came Up with the Discrete Cosine Transform." Digital Signal Processing, 1991. https://www.cse.iitd.ac.in/~pkalra/col783-2017/DCT-History.pdf
  14. IEEE Spectrum. "Nasir Ahmed Pioneered Digital Compression Algorithms." 2018. https://spectrum.ieee.org/compression-algorithms