Published 2026-05-16 · 16 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
Every codec decision your team makes — keyframe interval, bitrate ceiling, encoder preset, hardware versus software, even which codec to ship — is partly a bet on how well that codec will exploit the redundancy inside each frame. If you understand spatial redundancy, the rest of the encoder stops being a mystery: you can read a quality complaint and know whether the fix lives in intra coding or somewhere else, you can argue for or against AV1 over H.264 in concrete numbers rather than vibes, and you can talk to engineers about banding, blocking, and mosquito noise without sounding like you read the words off a slide. The audience for this article is the founder, product manager, marketing lead, or operations person who has zero prior knowledge of how compression actually works and needs to know enough to make decisions and ship product. By the end you will be able to explain spatial redundancy to a colleague, point at a frame and tell them which areas the codec will love and which it will hate, and trace a one-line connection from "the sky in this frame is mostly the same shade of blue" all the way to "we save five megabytes per second of bandwidth at the same picture quality".
The repetition that pays for video
Before any codec ever runs, a video frame already wastes most of its space. A 1920×1080 frame in standard 4:2:0 chroma subsampling carries roughly 1920 × 1080 × 1.5 = 3.11 million samples per frame, and at 8 bits per sample that is 24.9 megabits, or about 3.1 megabytes for one still frame. At 24 frames per second uncompressed 1080p flows at roughly 600 megabits per second. (See bitrate math: uncompressed vs compressed for the full arithmetic.) Nothing about that picture earns the 3.1 megabytes. Most of those samples are repetitions of their neighbours.
A useful analogy: imagine you photograph an empty conference room. The wall behind the whiteboard is a single colour for thousands of pixels in a row. Your phone, dutifully, recorded that colour thousands of times. A compressor that notices the wall is one colour can record the colour once and the size of the wall, and reconstruct the whole patch from those two numbers. That is the entire idea of spatial redundancy, and almost everything a codec does within a single frame is a more sophisticated version of it.
The technical name is spatial redundancy: the correlation between pixels that are close to one another inside the same frame. 1 When we remove that redundancy inside a single frame, the operation is called intra-frame compression or simply spatial compression. "Intra" is Latin for "within"; the codec is staying inside one frame and looking only at the picture in front of it. The opposite is inter-frame compression, which looks across frames and is the subject of its own article on temporal pixel correlation.
How much redundancy is there really? An influential measurement by Petrov and Zhaoping (2003) studied a large set of natural photographs and found that two-pixel correlations alone account for about 50% of the information redundancy in natural images; adding three-pixel correlations only buys another 4%. 2 In other words, the dominant signal in any photograph is "this pixel looks like the one next to it", and the second-order signal is "this pair of pixels looks like the next pair". Codec engineers have known this for decades, and intra coding is the result.
Figure 1. Spatial redundancy on a real frame. Flat regions and smooth gradients dominate the area; high-information blocks cluster around object edges and fine textures.
How the codec actually removes the redundancy
A modern intra encoder removes spatial redundancy in three stages. Stage one is block partitioning: the frame is sliced into small squares so that each square can be handled independently. Stage two is intra prediction: the encoder guesses what each block looks like from the blocks already encoded above and to its left, then stores only the difference between guess and reality (the "residual"). Stage three is transform and quantisation: the residual is converted to frequency coefficients (via the discrete cosine transform or a near relative), and the small high-frequency coefficients are rounded heavily or to zero. What gets written to disk is the index of the prediction mode plus a short list of non-zero coefficients.
Each of those three stages is the topic of its own dedicated article — hybrid codec architecture, intra-frame coding, transform coding, and quantization. The job of this article is to make stages one and two concrete enough that you understand what a codec is doing when you watch its encoder log scroll past.
Stage 1 — Block partitioning
Every modern codec starts by cutting the frame into a grid of square blocks. The size of the largest block has grown with every codec generation. H.264 uses blocks called macroblocks of 16×16 pixels (and sub-divides them down to 4×4). H.265 / HEVC introduced the Coding Tree Unit (CTU), up to 64×64 pixels. H.266 / VVC pushes the CTU to 128×128. AV1 uses superblocks of up to 128×128 pixels. 3 Bigger blocks help in the flat regions — a large patch of sky can be predicted in one shot — while smaller blocks help around edges and texture, where the picture changes quickly and a coarse prediction would be wrong.
Inside each block, the codec decides recursively whether to split further. Think of a sheet of newspaper laid over the frame: where the picture is plain, the encoder leaves one big sheet; where it is busy, it tears the sheet into smaller pieces. The split decisions themselves cost bits to signal, so the encoder is always balancing "smaller blocks predict the picture better" against "smaller blocks take more bits to describe".
Stage 2 — Intra prediction (the workhorse)
Once a block is chosen, the encoder looks at the strip of pixels along the top edge and along the left edge of that block — the reference pixels — and uses them to guess the contents of the block. The reference pixels are already decoded by the time the current block is processed, because the decoder will scan the frame in the same order (top-left to bottom-right), so the same reference pixels will be available on the decoding side. Whatever rule the encoder used to guess can be sent to the decoder as a small index, and the decoder will reproduce the same guess from the same references. Only the difference between guess and truth needs to travel.
Almost every codec offers three families of prediction:
- DC mode — fill the whole block with a single number, usually the average of the reference pixels. Useful for blocks that really are one flat shade.
- Planar / smooth mode — fill the block with a smooth gradient interpolated from the references. Useful for sky, walls in soft lighting, skin tones.
- Directional / angular modes — pretend the block has a strong directional texture (vertical edge, 45° edge, horizontal grain) and replicate the reference pixels along that direction into the block. Useful for fences, brick walls, hair, grass.
The number of directional modes has exploded over time. H.264 has 9 prediction modes for a 4×4 block (8 angles plus DC) and 4 modes for a 16×16 block. 4 H.265 / HEVC has 35 modes: 33 angular plus DC plus planar. 5 AV1 has 56 directional modes (8 nominal directions, each with 7 fine-tuned offsets at 3° steps from −9° to +9°) plus DC, planar, three Smooth modes, the Paeth predictor, and Chroma-from-Luma — about 62 intra modes in total depending on how you count. 6 H.266 / VVC has 67 directional/planar/DC modes, plus Matrix-based Intra Prediction (MIP), Multi-Reference Line (MRL), Intra Sub-Partitions (ISP), and Cross-Component Linear Model (CCLM) chroma prediction. 7
Each step up the codec ladder is partly an answer to one question: given a block of pixels in a natural photograph, can we describe it with fewer bits if we let the encoder choose from more prediction shapes? The answer turned out to be yes, again and again. The 56 directions in AV1 versus 33 in HEVC are a direct response to evidence that natural images contain a richer mix of edge orientations than HEVC's coarse grid could capture.
Figure 2. The main families of intra prediction. A modern codec tries every applicable mode on every block and keeps the one that produces the smallest residual at the lowest signalling cost.
Stage 3 — Transform and quantisation (the cleanup)
After prediction, the encoder is left with a small block of leftover error — the residual. If the prediction was good, the residual is mostly zeros with a few non-zero numbers near the edges of moving objects. If the prediction was bad, the residual carries most of the original block's energy. Either way, the codec applies a mathematical transform (almost always the discrete cosine transform, DCT, or one of its cousins) that re-expresses the 8×8 or 16×16 residual block as 64 or 256 frequency coefficients. The top-left coefficient is the DC value (average brightness of the block); the rest are AC coefficients ordered from low frequency to high.
Why bother? Because natural picture content has another statistical bias: its energy concentrates in the low-frequency coefficients. In a typical 8×8 DCT block, around 90% of the energy ends up in the first 25 coefficients, and the remaining 39 coefficients are near zero. 8 Quantisation rounds those tiny high-frequency coefficients to zero, the remaining non-zeros are written in zig-zag order, and a long run of trailing zeros gets compressed into almost nothing by the entropy coder. (Entropy coding is the topic of entropy coding intro.)
A short example makes it concrete. Suppose the residual block looks like this (numbers represent the deviation between predicted and real pixel intensities, on a roughly −128 to +127 scale):
12 8 4 2 1 0 0 0
8 6 3 1 0 0 0 0
4 3 2 1 0 0 0 0
2 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
After DCT, the energy concentrates into the top-left corner. Quantising the resulting coefficients with a typical quality scale rounds most of them to zero, leaving perhaps 4 non-zero values: a DC of 18, plus three small AC coefficients. The original 64-pixel block (512 bits if encoded raw) is now described by a prediction-mode index, four small numbers, and a "rest are zero" flag — usually fewer than 30 bits. The compression ratio for this single block is roughly 17:1, and it came entirely from spatial redundancy, before any other frame was consulted.
How many bits each tool actually saves
If you read the encoder logs from a real H.265 or AV1 run, a rough rule of thumb is:
| Source of saving on an I-frame | Approximate share of compression earned |
|---|---|
| Intra prediction (mode selection + residual) | 50–65% |
| Transform + quantisation (energy compaction) | 25–35% |
| Entropy coding (CABAC, range coding) | 8–15% |
| Block-size adaptation and partition | 5–10% |
The numbers shift with content — flat content leans on intra prediction, textured content leans on the transform — but on average, intra prediction is the workhorse and the transform is the cleanup. Signalling the prediction mode itself is not free: research on HEVC measures that intra-mode signalling consumes about 8–12% of the bits in an intra-coded frame. 9 Encoders therefore use a "Most Probable Modes" trick: the three modes most likely to occur (based on the modes of neighbouring blocks) are signalled in a single bit each, and only the unusual modes pay the longer code.
A common mistake when reading a codec spec is to treat the long list of intra modes as the source of compression. The modes themselves do not compress anything; they only re-shape the problem so the transform can finish the job cheaply. A codec with one hundred prediction modes and a bad transform would still compress badly. The two stages work together, and changing either one alone has limited effect.
How each codec generation evolved the toolbox
The progression is easiest to read as a table. Every row is one codec generation; every column is one knob the standard turned.
| Codec | Year | Max block | Intra modes | Transform | Chroma trick | Notable extras |
|---|---|---|---|---|---|---|
| MPEG-2 | 1995 | 16×16 MB | DC only (no directional intra) | 8×8 DCT, fixed | — | Intra blocks coded almost like JPEG |
| H.264 / AVC | 2003 | 16×16 MB | 9 (for 4×4) + 4 (16×16) | 4×4 integer DCT | — | I-PCM fallback for pathological blocks |
| H.265 / HEVC | 2013 | 64×64 CTU | 33 angular + planar + DC = 35 | 4×4 to 32×32 integer DCT, 4×4 DST | — | Constrained intra prediction, MPM list |
| VP9 | 2013 | 64×64 SB | 10 (8 dir + DC + TM) | 4×4 to 32×32 ADST/DCT | — | Frame-level adaptive transform |
| AV1 | 2018 | 128×128 SB | 56 directional + DC + Planar + 3 Smooth + Paeth + CfL ≈ 62 | 4×4 to 64×64, ADST + DCT + IDTX + FLIPADST | Chroma-from-Luma | Recursive partitioning, Wedge intra (combined with inter) |
| H.266 / VVC | 2020 | 128×128 CTU | 65 angular + planar + DC = 67 plus MIP | 4×4 to 64×64, DCT-II + DST-VII + DCT-VIII | Cross-Component Linear Model (CCLM) | MIP (matrix-based intra), MRL, ISP |
Two patterns are visible in the table. First, blocks get bigger and prediction grids get finer at the same time. A 128×128 superblock could not carry a useful single prediction if there were only 9 angles to choose from; it needs 56 or 67 to describe the variety of edge orientations a block that large is likely to contain. Second, chroma prediction gets cleverer. Early codecs treated the chroma channels almost as an afterthought. AV1's Chroma-from-Luma and VVC's CCLM both exploit the fact that the chroma channels (the colour) and the luma channel (the brightness) of the same block are not independent — sharp edges in luma usually mean sharp edges in chroma, and the encoder can save bits by sending the chroma block as a linear function of the already-decoded luma block. 10 11
A useful way to look at the table: every column is one knob that a future codec can still turn. AV2 and the early neural-codec literature both turn most of these knobs further — more prediction modes, learned predictors, smarter chroma — without changing the underlying problem statement. The problem statement has been the same since H.261 in 1988: remove the spatial redundancy from each frame before anything else.
A common pitfall: "more modes mean better quality"
When teams compare codecs at the spec level, they often conclude the codec with more intra modes will always produce better quality. That instinct is half right and half wrong.
The half that is right: at the same bitrate, a codec with more prediction shapes can usually find a closer match to any given block, so the residual is smaller, so the transform output is smaller, so the file is smaller. AV1 and VVC do beat H.264 and H.265 on like-for-like content.
The half that is wrong: the extra modes are not free. The encoder has to try them all (or use heuristics to skip most of them), which costs encoder CPU. The decoder has to dispatch on the mode for every block, which costs decoder CPU and memory bandwidth. And every mode that gets used has to be signalled to the decoder, which costs bits. Beyond some point, adding more modes stops earning compression and starts spending it. Codec designers measure this carefully, and most of the modes added past about 35 have very small individual contribution (often <0.2% BD-rate improvement per mode); they survive because the combination matters even if no single mode does.
The practical version of this pitfall is the engineer who picks a "slower" encoder preset, sees compute go up 4×, and finds the quality has gone up by 0.4 dB PSNR — and is disappointed because the slide deck said modern codecs are "much better" than older ones. The slide deck was right at the codec level; the disappointment is about the diminishing returns of mode-search at the preset level. (For the full mechanism, see mode decision and rate-distortion optimisation.)
Where the savings come from on a real frame
Take a sports broadcast. The pitch is one of the easiest things in video to encode: a uniform green that an intra encoder describes with a single DC mode and a residual of nearly zero. The crowd is the hardest: thousands of tiny textures with no useful directional structure. The scoreboard sits in between: sharp horizontal edges that a vertical-directional mode predicts almost exactly.
On a 1080p I-frame from the broadcast, the encoder will spend perhaps 8% of the available bits on the pitch (which is 50% of the picture area), 60% on the crowd (which is 30% of the picture area), and 32% on the scoreboard, faces, and rapidly moving objects (the remaining 20%). The pitch's compression ratio is around 500:1; the crowd's is around 20:1. The whole frame averages out to roughly 100:1 on a well-tuned AV1 encode. The difference between "good" and "bad" intra coding shows up almost entirely in the hardest 30% of the picture; the easy 70% is solved by every codec back to MPEG-2.
This is also where a senior engineer earns their salt: tuning the encoder so that the rare hard blocks get more bits than the average and the easy blocks get less. Rate-control algorithms (see rate control: CBR, VBR, CRF) and adaptive quantisation are how those tradeoffs are made.
Where Fora Soft fits in
Fora Soft has shipped 239+ video projects since 2005 across video streaming, video conferencing, OTT, surveillance, e-learning, telemedicine, and AR/VR. The piece our team touches most often is the encoder configuration that controls intra coding: keyframe interval, intra refresh patterns, profile and level negotiation in WebRTC sessions, and the I-frame placement that determines how quickly a stream can recover from a lost packet on a flaky network. The right intra-coding configuration is not the same for a sports stream, a telemedicine consultation, and a 1,000-participant webinar — and getting it wrong shows up as either bloated bills (too many I-frames) or visible decode errors (too few). When we hand off a streaming pipeline, the intra-coding profile is one of the parts we tune by content type rather than by guess.
A short walk through one block
To make all of the above concrete, walk through what happens to a single 8×8 block of a fairly flat patch of sky in an HEVC encode.
The encoder partitions a 32×32 region of mostly-blue sky into four 16×16 blocks, then notices each one is nearly uniform and keeps them as 16×16 rather than splitting further. It tries all 35 intra prediction modes on the first 16×16 block. The planar mode produces a residual whose maximum absolute value is 3 (on a 0–255 scale); the next-best angular mode produces a max of 8. Planar wins. The encoder stores: prediction-mode index = 1 (planar), and the 16×16 residual.
The 16×16 residual goes into a 16×16 integer DCT. Of the 256 output coefficients, the DC is 2.4 (the small mean error of the prediction), three AC coefficients near the top-left are between 0.5 and 1.0, and the remaining 252 coefficients round to zero after quantisation at a typical quality setting. The block writes to the bitstream as: 5 bits of mode signalling (a "most probable mode" coding), 6 bits of DC coefficient, 12 bits of three AC coefficients, plus a few flag bits. Total: roughly 30 bits for a 16×16×8 = 2,048-bit raw block.
The compression ratio on this block is 2,048 ÷ 30 ≈ 68:1, and it was all bought by spatial redundancy. Multiply that by every block in a sky-heavy frame and you understand why an I-frame of a beach scene costs a fraction of an I-frame of a confetti shot.
What to read next
- Temporal pixel correlation: redundancy between frames
- Hybrid video codec architecture
- Intra-frame coding: how a single frame is compressed
Talk to us / See our work / Download
Talk to a video engineer — book a 30-minute scoping call with our delivery team to walk through your encoder configuration.
See our case studies — read how we shipped intra-coding-aware streams for OTT and streaming, surveillance, and telemedicine clients at forasoft.com/projects.
Download the intra-prediction cheat sheet (PDF) — one page summarising every intra mode in every major codec, with block-size limits, when each mode wins, and the typical share of an I-frame each mode earns.
References
-
ScienceDirect Topics, Spatial Redundancy. https://www.sciencedirect.com/topics/computer-science/spatial-redundancy (accessed 2026-05-16). ↩
-
Petrov, Y. and Zhaoping, L. (2003), Local correlations, information redundancy, and sufficient pixel depth in natural images, Journal of the Optical Society of America A, 20(1), 56–66. https://opg.optica.org/josaa/abstract.cfm?uri=josaa-20-1-56 ↩
-
Sullivan, G. J., Ohm, J.-R., Han, W.-J. and Wiegand, T. (2012), Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE Trans. Circuits Syst. Video Technol., 22(12), 1649–1668; Chen, Y. et al. (2018), An Overview of Core Coding Tools in the AV1 Video Codec, AOMedia. https://www.jmvalin.ca/papers/AV1_tools.pdf ↩
-
ITU-T H.264 (2003 and later editions), Advanced video coding for generic audiovisual services — Section 8.3 Intra prediction process. https://www.itu.int/rec/T-REC-H.264 ↩
-
ITU-T H.265 (2013 and later editions), High Efficiency Video Coding — Section 8.4 Intra prediction process; Elecard, Spatial (Intra) prediction in HEVC. https://www.elecard.com/page/spatial_intra_prediction_in_hevc ↩
-
AOMedia (2020), AV1 Bitstream & Decoding Process Specification — Section 7.11 Intra prediction; Chen, Y. et al., A Technical Overview of AV1, arXiv:2008.06091. https://arxiv.org/abs/2008.06091 ↩
-
Bross, B. et al. (2021), Overview of the Versatile Video Coding (VVC) Standard and its Applications, IEEE Trans. Circuits Syst. Video Technol., 31(10), 3736–3764; ITU-T H.266 (2020). https://ieeexplore.ieee.org/document/9402788 ↩
-
ScienceDirect Topics, Energy Compaction; Khayam, S. A. (2003), The Discrete Cosine Transform (DCT): Theory and Application, Michigan State University Technical Report 802. https://www.cse.iitd.ac.in/~pkalra/col783-2017/DCT-TR802.pdf ↩
-
Lainema, J. and Ugur, K. (2012), Improved intra mode signaling for HEVC, MERL Technical Report TR2012-035. https://www.merl.com/publications/docs/TR2012-035.pdf ↩
-
Trudeau, L. N., Egge, N. E. and Barr, D. (2018), Predicting Chroma from Luma in AV1, in Proc. Data Compression Conference (DCC). https://arxiv.org/abs/1711.03951 ↩
-
Bross, B. et al. (2021), Overview of the Versatile Video Coding (VVC) Standard and its Applications, IEEE Trans. Circuits Syst. Video Technol. — Section IV.C Cross-component prediction. https://ieeexplore.ieee.org/document/9402788 ↩


