Key Scientific Breakthroughs Behind Video Codecs: Informatio

Published 2026-05-16 · 21 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If you fund, run, or sell anything that touches video — a streaming service, a video-conferencing product, a surveillance platform, an OTT app — you spend money on bitrate every day. Every megabit you save per stream pays you back in CDN bills, storage, mobile-data charges, and battery life on the user's device. The compression you get depends on three scientific ideas the entire industry has been refining since the 1940s, and engineers will reference them in every codec meeting you sit in. After reading this article you'll know what entropy actually measures, why "prediction" is a single word that hides decades of research, why the DCT is in almost every photo and video you've ever seen, and why neural networks are now the next chapter of the same story. You don't need any prior background. We define every term in plain language before using it, and we walk the math out loud the first time it appears.

The Three Ideas in One Picture

Before the history, the map. Every modern video codec stacks the same three big ideas on top of each other in the same order.

Stacked diagram showing the three pillars under any modern video codec: information theory at the foundation, prediction in the middle, transform on top, with a thin label band for entropy coding and quantization above them Figure 1. The three scientific breakthroughs that every video codec since 1988 sits on. Information theory defines the prize. Prediction and transform are the two engines that chase it. Quantization and entropy coding are the gears that connect the engines to the bitstream.

The first pillar, information theory, is the rule book. It tells the encoder how small a file is allowed to get for a given amount of acceptable damage to the picture. Without information theory, an engineer cannot tell whether a new codec is genuinely clever or whether the gain came from somewhere else. With it, every benchmark — bits per pixel, BD-rate, VMAF at fixed bitrate — has a meaning.

The second pillar, prediction, is the trick of describing one piece of video as a small correction to something the decoder already knows. Already knows is the load-bearing phrase. The decoder might know the pixel immediately to the left (intra-frame prediction), or the same patch as it looked in the previous frame (inter-frame prediction), or a weighted average of the patch in two surrounding frames (B-frame prediction). Every successful video codec since H.261 in 1988 puts prediction first, because frames of real video are almost never independent.

The third pillar, transform, is the trick of rewriting a block of pixels (or, more usefully, a block of prediction errors) as a list of frequency numbers. Most of those numbers turn out to be small or zero, and the human eye is forgiving of small errors in the rest. Throw the small ones away with quantization, then squeeze the survivors with entropy coding, and you have the back half of a working video codec.

The rest of this article walks through how each pillar was discovered, what it actually does, and where the field is going next.

Pillar One — Information Theory: The Mathematical Floor (1948–1959)

In October 1948 a Bell Labs engineer named Claude Elwood Shannon published a paper in two parts of the Bell System Technical Journal called A Mathematical Theory of Communication. ¹ The paper invented a whole field — information theory — and gave everyone after him the language we still use to talk about compression. Two of its results matter for video:

The source coding theorem. Every source of messages — text, audio, pixels — has a hard lower bound on how many bits you need to encode it without loss. That bound is called the source's entropy, written H, and it's measured in bits per symbol. Shannon proved that you can get arbitrarily close to H, and that you cannot do better.
Rate–distortion theory. If you're willing to accept D units of damage to the original signal, you can do better than the lossless floor. Specifically, there is a function R(D) — the rate–distortion function — that gives the smallest possible bit rate for a chosen distortion budget. Shannon sketched this in 1948 and formalised it in 1959. ²

We'll unpack each in plain language because the rest of the article rests on both.

Entropy in one paragraph (and the math, out loud)

Entropy is a measure of surprise. If a symbol always carries the same value — a coin that always lands heads — it carries no information, and its entropy is 0 bits. If it has many equally likely values — a fair 256-sided die — it carries a lot, and its entropy is high. Shannon's formula for a source that emits symbol i with probability pᵢ is:

H = − Σᵢ pᵢ × log₂(pᵢ) bits per symbol.

A fair coin: H = −(0.5 × log₂ 0.5 + 0.5 × log₂ 0.5) = −(−0.5 − 0.5) = 1 bit per flip. A coin that lands heads 99% of the time: H = −(0.99 × log₂ 0.99 + 0.01 × log₂ 0.01) ≈ −(−0.0144 − 0.0664) ≈ 0.081 bits per flip. The biased coin is twelve times more compressible than the fair one, because it's twelve times less surprising.

Pixel values in real video are not fair coins. The pixel next to this one is almost certainly very close in value. The block of pixels you're looking at right now is almost certainly very close to the same block one frame ago. Both facts make the entropy of video far lower than the raw pixel count suggests, and that gap is the room every compressor lives in.

Rate–distortion in one paragraph

For lossy compression — the kind every video codec does — Shannon's other gift was rate–distortion theory. Given a chosen distortion measure (mean squared error is the textbook example, perceptual scores like SSIM and VMAF are the modern ones) and a tolerated distortion D, the rate–distortion function R(D) is the smallest possible bit rate that can still keep the average distortion at or below D. The curve R(D) is monotonically decreasing: accept more damage, send fewer bits. Every real codec lives somewhere above its theoretical rate–distortion curve; the gap between "where we are" and "where Shannon says we could be" is the headroom for the next codec generation.

Schematic rate-distortion curve showing the Shannon lower bound, a 2000-era codec operating well above it, and modern AV1 closing most of the gap. X-axis distortion, Y-axis bitrate, with three labelled points marking JPEG, H.264, AV1. Figure 2. The rate–distortion view of forty years of codec progress. The black curve is the theoretical Shannon floor for a generic video source. The coloured dots are real codec operating points at matched perceptual quality; each new generation moves the dot down and to the left, closer to the floor.

Why this changed video coding

Information theory gave the industry three things it lacked before 1948:

A way to define the prize: the smallest possible file. Without entropy, every "we compressed it more" claim was an opinion.
A way to score work: bits per pixel and BD-rate (the Bjøntegaard delta-rate, which compares two codecs as the area between their rate–distortion curves) both come straight from rate–distortion theory.
A way to decide what's worth doing next: the further the current codec sits above its theoretical floor, the more room there is for the next generation. As we discuss later, that gap has narrowed a lot, and the question of how close we now are to the floor is the central question for codecs in 2026.

The connection from Shannon to the bitstream goes through a piece of plumbing called entropy coding — Huffman coding (1952), arithmetic coding (1976–1981), and Context-Adaptive Binary Arithmetic Coding or CABAC (2003). We treat that plumbing in depth in Entropy coding: a short introduction (Huffman, arithmetic, CABAC). For this article, hold the headline: entropy coding is the last step that actually turns a stream of numbers into the smallest possible string of bits. Everything else in a codec is preparation for that step.

Pillar Two — Prediction: Don't Send What the Decoder Can Guess (1969–1981)

The second breakthrough is the simplest to say and the hardest to do well. Prediction is the trick of describing a new piece of video as a small correction to something the decoder already has. The bigger the correction, the more bits it costs. The whole game is to make the prediction as good as possible so that the correction (called the residual) is as close to zero as possible.

From sending pixels to sending differences (1952–1969)

Before prediction, every pixel was sent as a number from 0 to 255. The 1952 textbook trick was differential pulse-code modulation (DPCM) — send the difference between each pixel and a prediction made from the pixels just to its left and above. In a smooth region of an image, those differences are small and clustered near zero, so they compress better than raw pixel values. DPCM was the first commercially successful predictive coder and it powered early digital still-image compression for decades.

The first really interesting prediction in video came in 1969 when F. W. Mounts at Bell Labs proposed conditional replenishment. ³ His idea: most of the pixels in a typical frame haven't changed since the last frame, so don't send them at all. For each block, the encoder makes a binary decision — "changed" or "unchanged". Unchanged blocks are skipped entirely; the decoder just leaves them at the value they already have. Changed blocks are sent. This is the first time anyone explicitly used temporal redundancy in video, and the bit savings on a typical talking-head shot were dramatic.

The weakness of conditional replenishment is also obvious: if the camera or the subject moves at all, almost every block becomes "changed", and the compression collapses. To handle motion you need something cleverer.

Motion compensation (1969–1981)

The cleverer thing arrived in three waves.

Wave one — pel-recursive motion estimation. Through the 1970s, A. N. Netravali and J. D. Robbins at Bell Labs worked out how to estimate the displacement of each pixel from one frame to the next, then use the displaced pixel from the previous frame as a prediction for the current one. ⁴ Their 1979 paper "Motion-Compensated Television Coding" showed that motion compensation could roughly halve the bit rate against pure conditional replenishment, and it gave the field the name motion-compensated prediction that we still use today.

Wave two — block matching. In 1981 J. R. Jain and A. K. Jain published "Displacement Measurement and Its Application in Interframe Image Coding" in IEEE Transactions on Communications. ⁵ Instead of estimating a separate displacement for every pixel, they cut the frame into small rectangular blocks (typically 16 × 16 pixels) and found a single best-matching displacement for each whole block by searching a window of the previous frame. That displacement is called a motion vector. The "best match" is whichever candidate in the search window has the smallest sum of absolute differences from the current block. The same year, T. Koga and colleagues independently described essentially the same algorithm. Block-matching motion estimation is the architecture every standard codec from H.261 (1988) onwards uses today.

The reason block matching won is engineering, not math. A pel-recursive estimator can in principle give a more accurate displacement field than block matching, but it has to be re-run at the decoder, which is expensive. With block matching, the encoder does all the searching, sends the few small motion vectors it found, and the decoder does the cheap part — copy a 16×16 region from the previous frame at the displacement the encoder told it. The encoder–decoder asymmetry that the industry calls "make the encoder work hard so the decoder doesn't have to" was set in this period and has not moved since.

Wave three — sub-pixel precision and B-frames. Two refinements followed quickly. Sub-pixel motion vectors (half-pel in H.261, quarter-pel in MPEG-4 ASP and H.264, 1/8-pel in some VVC modes) let the encoder describe motion that doesn't land on an integer pixel grid by interpolating between neighbouring pixels in the reference frame. Bidirectional prediction (the B in B-frames) lets the encoder predict a block from a weighted blend of one past and one future reference frame, which is especially powerful when something is briefly occluded or smoothly fading. We explain B-frames in depth in GOP structure: I, P, B-frames, open vs closed GOP; for here, hold the headline: a well-placed B-frame costs roughly 30–40% fewer bits than the equivalent P-frame.

Intra prediction: predict from the same frame, not the previous one

A late but important addition is intra-frame prediction. The first frame of a video, or the first frame of a new scene, has no previous frame to predict from. The first generation of codecs simply transformed each block raw, with no prediction at all. H.264 in 2003 added an important new idea: even inside a single frame, the block you're about to encode is almost certainly similar to the pixels along its top edge and along its left edge, because those pixels were already decoded and are available. So you can predict the current 4 × 4 or 8 × 8 block as a directional copy of those neighbouring pixels — straight down, diagonally, horizontally, and so on — and only encode the residual.

H.264 offered 9 intra prediction modes per 4×4 block. HEVC raised it to 35 directions per block, AV1 has 56 directional modes plus several non-directional ones, and AV2's developing draft has roughly 80. Each new generation has spent silicon and encoder time chasing better predictions, because every extra mode means fewer residual bits to spend.

Side-by-side comparison: left panel shows pure conditional replenishment with a moving figure causing most blocks to be re-sent, right panel shows motion-compensated prediction where only small motion vectors and small residuals are sent for the same scene Figure 3. Two ways to handle a moving figure walking across a static background. Conditional replenishment (left) has to re-send almost the whole frame because almost every block has changed by at least one pixel. Motion-compensated block matching (right) sends only a small set of motion vectors and the tiny residual where the prediction doesn't quite match. The bit-count difference is roughly 5–10× in favour of motion compensation on a typical sequence.

Why prediction matters so much

For the average video sequence, the residual — the difference between a frame and the encoder's best guess of it — has an entropy that is roughly an order of magnitude lower than the entropy of the raw pixel values. That means the rest of the codec (transform, quantization, entropy coding) is working on a signal that is already 10× smaller in information content than the raw frame. Everything else in the encoder is amplifying the savings prediction already made.

This is why every generation of every codec spends most of its new transistors and standardisation pages on prediction. H.264's intra prediction was a big jump over MPEG-4 ASP. HEVC's expanded directional intra modes and larger inter blocks gave it most of its 50% bit-rate saving against H.264. AV1's compound prediction (blending two motion-compensated references with masks and wedges) is one of the biggest sources of its gain. Neural-network video codecs in 2026 — like Bytedance's DCVC family and Disney's NVRC — explicitly model their architecture as "learned prediction first, learned compression of the residual second", because the same logic still wins.

Pillar Three — Transform: Repackage the Pixels So You Can Throw Most of Them Away (1947–1974)

The third pillar is the transform. The idea: take a small block of numbers (after prediction, those numbers are the residual — the prediction error) and rewrite them as a different list of numbers, with the property that most of the new numbers are close to zero and the few that aren't carry almost all of the original's "energy". Then quantize away the small ones and you've shed bits without shedding much visible content.

The KLT: theoretically optimal, practically unusable (1947)

The mathematics for this idea came out of statistics in the 1940s, with the Karhunen–Loève transform (KLT, sometimes called the Hotelling transform or just principal component analysis). For a given source — a 4 × 4 block of pixels, say — the KLT is the unique transform that decorrelates the block completely and packs the most energy into the fewest coefficients. In information-theoretic terms, the KLT is the optimal transform for any Gaussian source. ⁷

That sounds like the end of the story. It is not. The KLT has three problems that have kept it out of every shipping codec for eighty years:

It is data-dependent. The KLT basis vectors are eigenvectors of the source covariance matrix. If the source statistics change — different content, different prediction residual, different scene — you need a different KLT. So either the basis is sent with every block (large overhead) or the codec must agree on a fixed basis in advance, which makes the transform sub-optimal for any specific block.
There is no fast algorithm. A KLT on an N×N block costs O(N³) operations and has no exploitable structure. Compare to the DCT, which on an N×N block costs O(N² log N) thanks to a fast algorithm structurally similar to the FFT.
Even on its home turf — first-order Markov sources — the asymptotic gain over the DCT is tiny. The DCT approaches the KLT to within a fraction of a decibel for image-like sources, and the gap closes further when you add quantization noise. ⁸

For the entire history of standard video coding, the KLT has been a benchmark to compare against, not a transform to ship. The transform that did ship was a much friendlier near-relative.

The DCT (1974): the workhorse that ate the world

In January 1974, IEEE Transactions on Computers published a three-page paper by Nasir Ahmed, Tarald Natarajan, and K. Ramamohan Rao called Discrete Cosine Transform. ⁶ Ahmed had been thinking about the problem since 1972; the practical algorithm was worked out with Natarajan and Rao at the University of Texas at Arlington. The paper introduced what is now called the DCT-II, the inverse DCT-III, and the integer-friendly fast algorithm that made the whole family computable in real time on the hardware of the day.

The DCT does for image blocks what the Fourier transform does for time signals: it rewrites the block as a weighted sum of cosine waves at different spatial frequencies. The first coefficient (the DC coefficient, in a deliberate borrowing from electrical engineering) is the average brightness of the block. The remaining 63 coefficients in an 8 × 8 block describe the higher-frequency wiggles — first-order spatial gradients, then finer texture, then very fine texture in the bottom-right corner.

For natural images, three useful things happen at once:

Energy compaction. Real photo-realistic blocks are smooth on average, so almost all the energy lands in the top-left handful of coefficients. The bottom-right coefficients are usually tiny.
Decorrelation. The DCT coefficients of a natural-image block are nearly statistically independent. That means quantization noise spreads out cleanly, instead of correlating with the original signal in unpleasant ways.
Perceptual alignment. The human visual system is much more sensitive to low frequencies (slow brightness changes across a wide area) than to high frequencies (texture). That means you can quantize the high-frequency DCT coefficients much more aggressively than the low-frequency ones and your eye won't see most of the damage. JPEG's famous quantization tables encode exactly this perceptual asymmetry.

The DCT shows up first in still-image coding (JPEG, finalised in 1992 ⁹) and then in every video codec from H.261 (1988) onwards. It is, by a wide margin, the most widely deployed signal-processing transform ever built.

The DCT's children: integer transforms, ADST, the AV2 family

Modern codecs no longer use the exact DCT of the 1974 paper. They use integer approximations of it, sized to match the codec's coding unit grid. Three reasons:

Integer transforms are exactly invertible. The original DCT uses irrational cosine values; round-trip floating-point arithmetic can introduce tiny drifts between encoder and decoder. Integer approximations eliminate that drift, which is essential when many decoders must produce bit-identical output from the same bitstream.
They run faster on every CPU. Integer multiplies are cheap, especially on the SIMD and matrix units that codec hardware lives on.
They can be co-designed with the codec's other tools — for example, by aligning the transform's basis vectors with the directions of the codec's intra prediction modes.

H.264 introduced the first integer DCT, a 4 × 4 design. HEVC added integer DCT and DST (Discrete Sine Transform) variants at 4, 8, 16, and 32 sizes. AV1 introduced a whole family of transforms: in addition to the DCT, it can use the ADST (Asymmetric DST), the FLIPADST (flipped variant for residuals whose energy clusters at the bottom of a block), and IDTX (identity transform — useful when the residual is already sparse). The encoder picks the best transform per block via rate–distortion optimisation. VVC and the developing AV2 add even more transform options and larger block sizes.

The Karhunen–Loève transform is still the asymptotic ceiling against which all of this is compared. Forty-six years after Ahmed, Natarajan and Rao's paper, the DCT and its near-relatives still sit within a fraction of a decibel of the ceiling, at a small fraction of the computational cost. That is why the DCT is, in coding folklore, the longest-running engineering deal in signal processing.

Side-by-side: left, an 8x8 pixel block of a face showing smooth variation; middle, the same block after DCT showing energy clustered in the top-left; right, the same block after quantization with only 7 non-zero coefficients out of 64, demonstrating the energy compaction property of the DCT. Figure 4. The DCT's superpower in one picture. An 8×8 pixel block of a typical face region (left, 64 numbers) is transformed into 64 DCT coefficients (middle). Almost all the energy lands in the top-left corner. After quantizing the remaining coefficients to a typical JPEG-quality table, only 7 of the 64 numbers remain non-zero (right). Those 7 numbers compress to roughly 30 bits with arithmetic coding — versus 512 bits for the raw block.

How the Pieces Fit: The Hybrid Block Pipeline

Stack the three breakthroughs on top of each other in the right order and you have the hybrid block-based architecture that every standard codec from H.261 onwards has used. The encoder runs the steps in one order; the decoder runs them in reverse to put the picture back together.

Pipeline diagram showing the encoder loop: input block → prediction → residual → transform → quantization → entropy coding → bitstream, with a return arrow showing inverse quantization, inverse transform, reconstruction, and in-loop filtering feeding the reference frame buffer for the next prediction Figure 5. The hybrid block-based encoder, drawn with the three scientific breakthroughs colour-coded. Blue blocks are prediction. Green blocks are transform. Purple blocks are information-theoretic — quantization (the only stage that loses information) and entropy coding (the squeeze at the end). The return path that turns the encoder's quantized residual back into a reference frame is the reason every hybrid codec is more than a one-shot pipeline.

In one paragraph: the encoder takes a block of pixels, predicts it from neighbouring pixels or the previous frame, computes the residual (the prediction error), transforms the residual into frequency coefficients, quantizes those coefficients (the only step that loses information), and entropy-codes the result into the smallest possible string of bits. The encoder also runs the inverse of those last three steps locally, so it can reconstruct what the decoder will see and use it as a reference for the next block's prediction. An in-loop filter (deblocking, SAO, ALF, CDEF — names that vary by codec) cleans up the reconstruction before it goes into the reference frame buffer. We go through the architecture in much more depth in Hybrid video codec architecture.

The hybrid architecture has not changed since 1988. The arrows have not moved. What has changed in every codec generation is the content of the boxes: bigger and more adaptive blocks, more prediction modes, more transform variants, more sophisticated quantizers, smarter entropy coding. Each change has bought a few percent of bit-rate saving. Forty years of those few-percent gains compound to the order-of-magnitude difference between MPEG-1 and AV1.

The fourth pillar that holds it all together: rate-distortion optimisation

There is a fourth idea that is not exactly a "breakthrough" of its own but that deserves a paragraph here because it is what lets the others actually deliver their gains in a real encoder.

A typical 4K block in a 2026 codec has hundreds of possible coding-mode combinations: which prediction mode, which transform, which quantization step, which partition, which reference frame. The encoder cannot pick one randomly and hope. It picks the combination that minimises a Lagrangian cost of the form:

J = D + λ × R

where D is the distortion (typically sum of squared differences between the original and the reconstructed block), R is the number of bits the candidate combination will spend, and λ (lambda) is a tuning knob that turns quality into bits. This rate–distortion optimisation machinery was formalised for video by Gary Sullivan and Thomas Wiegand in the late 1990s and was the central design pattern of H.264's encoder. ¹⁰ It is still the engine inside every modern reference encoder (the libaom AV1 encoder, the VVC test model VTM, the x264 / x265 / SVT-AV1 family). We explain it from scratch in Mode decision and rate-distortion optimization (RDO).

The connection back to Shannon is direct. J = D + λR is the Lagrangian dual of Shannon's R(D) curve. RDO is the practical algorithm that lets a real encoder operate as close to the rate–distortion bound as its mode menu allows.

A Common Mistake: Confusing the Three Pillars

A pattern we see often in product reviews and vendor pitches is to credit one pillar for a gain that actually came from another. A few examples worth flagging:

"AV1 is more efficient because of its new transforms." AV1's transform menu is wider than H.264's, but the biggest part of its 30% bit-rate saving against H.264 comes from richer prediction (more intra modes, compound inter prediction, larger blocks) and from rate-distortion-driven partitioning. The transform improvements alone account for a single-digit-percent gain.
"HEVC saves bits because of better entropy coding." HEVC's CABAC is mildly improved over H.264's CABAC, but most of HEVC's 50% bit-rate saving against H.264 came from larger Coding Tree Units (up to 64 × 64), more directional intra modes, and improved motion-vector signalling. Entropy coding got about 5–10% of the headline number.
"Neural codecs win because they use ML for entropy coding." Some do — Bytedance's DCVC variants do use a learned hyperprior model to predict the distribution of latents — but the much bigger gain in 2025–2026 neural codecs comes from learned prediction that effectively replaces both motion estimation and the transform with a single learned encoder–decoder network. The information-theoretic backbone is unchanged.

When you read a codec announcement, mentally check which of the three pillars the gain is being attributed to, and ask whether the attribution is plausible. A vendor that doesn't separate the pillars is a vendor that probably hasn't measured them separately.

The Modern Frontier: Are Neural Codecs a Fourth Pillar?

The three pillars we've described are all hand-engineered. The basis vectors of the DCT are written down by humans. The set of intra prediction modes is enumerated by a committee. The shape of the entropy model is fixed in the standard. The encoder has freedom to choose among the menu the standard provides, but the menu itself is fixed.

End-to-end neural video codecs published since 2018 try to remove that constraint. The encoder is a convolutional or transformer-based neural network that takes raw frames as input and emits a small set of latents — learned, compressed representations. The decoder is another neural network that inverts the encoder to produce a reconstructed frame. The whole pair is trained to minimise a rate–distortion loss of exactly the Lagrangian form J = D + λR — Shannon's curve, with all three pillars implicitly learned from data rather than designed by hand. ¹¹

By late 2024, the DCVC family (Deep Contextual Video Compression, originally from Microsoft Research Asia) became the first learned codec to surpass H.266 / VVC on standard test sequences in PSNR. ¹² By 2025, generative neural codecs based on diffusion models such as Bytedance's GNVC-VD were showing strong perceptual-quality wins at extremely low bitrates (below 0.01 bits per pixel), at the cost of decode complexity that is still several orders of magnitude higher than VVC. NVRC (Neural Video Representation Compression), an implicit neural representation approach published in 2024, reported a 23% BD-rate gain over VVC's reference encoder on the UVG dataset. ¹³

Whether end-to-end neural codecs become a true fourth pillar depends on the answer to one engineering question: can the decoder run inside the power and memory budget of a mobile phone or a smart TV? Today the answer is no — even the smallest neural decoders need several billion FLOPs per frame, which is far more than the few hundred million a hybrid AV1 decoder needs. Hybrid + neural hybrids (using a neural network for one stage of an otherwise classical codec — for example, AOMedia's experiments with a CNN-based in-loop filter for AV2) are the more likely first commercial path. We track the state of that horse race in The future: AV2, neural codecs, end-to-end learned compression.

For the next three years, the safe operating assumption for anyone planning infrastructure is: the three classical pillars are not going anywhere, hardware decoders for AV1 and AV2 will be your delivery codecs, and neural compression will start to appear as a quality enhancement layer (super-resolution, denoising, frame interpolation) on top of a hybrid bitstream, not yet as a replacement for it. The cheat sheet at the bottom of this article lays out the milestones to watch.

Where Fora Soft Fits In

Fora Soft has shipped 239 production video systems since 2005 across video conferencing, OTT and IPTV, video streaming, video surveillance, e-learning, telemedicine, and AR/VR. In every one of those projects, the choice of codec — and therefore the choice of which scientific breakthrough we're leaning on — shapes the architecture. On low-latency WebRTC services we trade transform efficiency for low decoder complexity and short GOPs. On high-volume OTT we lean into HEVC and AV1's better prediction at the cost of larger encoder farms. On surveillance, the same prediction that compresses sports footage perfectly is also what lets us run motion-triggered storage at a fraction of the bitrate of continuous recording. The three pillars are not abstractions; they are the levers we tune every time we spec a streaming or video product for a client.

Talk to a Video Engineer · See Our Case Studies · Download the Cheat Sheet

If you're picking codecs for a new product, scoping a streaming architecture, or auditing a video pipeline that's leaking money, our engineers can help you map the three pillars onto the real trade-offs your service has to make.

Talk to a video engineer — book a 30-minute scoping call.
See our case studies — 239+ shipped video projects across OTT, WebRTC, surveillance, telemedicine, and e-learning.
Download the Video Coding Scientific Breakthroughs Cheat Sheet — one-page printable summary of the three pillars, the people who built them, and the milestone papers.

References

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423 and 27(4), 623–656. Foundational paper of information theory; introduces entropy, source coding, channel capacity. ↩
Shannon, C. E. (1959). Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, Part 4, 142–163. Formal introduction of rate–distortion theory. ↩
Mounts, F. W. (1969). A Video Encoding System Using Conditional Picture-Element Replenishment. Bell System Technical Journal, 48(7), 2545–2554. First explicit use of temporal redundancy in video coding. ↩
Netravali, A. N., & Robbins, J. D. (1979). Motion-Compensated Television Coding: Part I. Bell System Technical Journal, 58(3), 631–670. Defines motion-compensated prediction; coins the term. ↩
Jain, J. R., & Jain, A. K. (1981). Displacement Measurement and Its Application in Interframe Image Coding. IEEE Transactions on Communications, COM-29(12), 1799–1808. Block-matching motion estimation, full search and 2-D logarithmic search. ↩
Ahmed, N., Natarajan, T., & Rao, K. R. (1974). Discrete Cosine Transform. IEEE Transactions on Computers, C-23(1), 90–93. Introduces DCT-II and its fast algorithm; foundation of JPEG, MPEG, H.26x. ↩
Akansu, A. N., & Torun, M. U. (2015). A Primer for Financial Engineering: Financial Signal Processing and Electronic Trading. Academic Press. Chapter on KLT optimality and the formal proof that the KLT maximises coding gain for Gaussian sources. ↩
Effros, M., Feng, H., & Zeger, K. (2004). Suboptimality of the Karhunen–Loève Transform for Transform Coding. IEEE Transactions on Information Theory, 50(8), 1605–1619. Proof that the KLT is not optimal when followed by uniform scalar quantization. ↩
Wallace, G. K. (1992). The JPEG Still Picture Compression Standard. IEEE Transactions on Consumer Electronics, 38(1), xviii–xxxiv. The reference paper that defined how DCT is used in JPEG. ↩
Sullivan, G. J., & Wiegand, T. (1998). Rate-Distortion Optimization for Video Compression. IEEE Signal Processing Magazine, 15(6), 74–90. The canonical practitioner reference for Lagrangian RDO in video encoders. ↩
Lu, G., Ouyang, W., Xu, D., Zhang, X., Cai, C., & Gao, Z. (2019). DVC: An End-to-end Deep Video Compression Framework. CVPR 2019. First fully end-to-end learned video codec to outperform H.264. ↩
Li, J., Li, B., & Lu, Y. (2023). Neural Video Compression with Diverse Contexts. CVPR 2023. The DCVC-DC paper; first learned codec to outperform VVC reference encoder in PSNR on multiple test sets. ↩
Kim, T., Oh, T., Hauptmann, A., & Park, T. (2024). NVRC: Neural Video Representation Compression. NeurIPS 2024. INR-based learned codec reporting 23% BD-rate gain over VTM-RA on UVG. ↩
Bitmovin. (2025). 9th Annual Video Developer Report. Industry survey covering codec adoption across 400+ video services. Available at bitmovin.com/video-developer-report. ↩

Key Scientific Breakthroughs Behind Video Codecs: Information Theory, Prediction, and Transform

Why This Matters

The Three Ideas in One Picture

Pillar One — Information Theory: The Mathematical Floor (1948–1959)

Entropy in one paragraph (and the math, out loud)

Rate–distortion in one paragraph

Why this changed video coding

Pillar Two — Prediction: Don't Send What the Decoder Can Guess (1969–1981)

From sending pixels to sending differences (1952–1969)

Motion compensation (1969–1981)

Intra prediction: predict from the same frame, not the previous one

Why prediction matters so much

Pillar Three — Transform: Repackage the Pixels So You Can Throw Most of Them Away (1947–1974)

The KLT: theoretically optimal, practically unusable (1947)

The DCT (1974): the workhorse that ate the world

The DCT's children: integer transforms, ADST, the AV2 family

How the Pieces Fit: The Hybrid Block Pipeline

The fourth pillar that holds it all together: rate-distortion optimisation

A Common Mistake: Confusing the Three Pillars

The Modern Frontier: Are Neural Codecs a Fourth Pillar?

Where Fora Soft Fits In

What to Read Next

Talk to a Video Engineer · See Our Case Studies · Download the Cheat Sheet

References

Related glossary terms

Key Scientific Breakthroughs Behind Video Codecs: Information Theory, Prediction, and Transform

Why This Matters

The Three Ideas in One Picture

Pillar One — Information Theory: The Mathematical Floor (1948–1959)

Entropy in one paragraph (and the math, out loud)

Rate–distortion in one paragraph

Why this changed video coding

Pillar Two — Prediction: Don't Send What the Decoder Can Guess (1969–1981)

From sending pixels to sending differences (1952–1969)

Motion compensation (1969–1981)

Intra prediction: predict from the same frame, not the previous one

Why prediction matters so much

Pillar Three — Transform: Repackage the Pixels So You Can Throw Most of Them Away (1947–1974)

The KLT: theoretically optimal, practically unusable (1947)

The DCT (1974): the workhorse that ate the world

The DCT's children: integer transforms, ADST, the AV2 family

How the Pieces Fit: The Hybrid Block Pipeline

The fourth pillar that holds it all together: rate-distortion optimisation

A Common Mistake: Confusing the Three Pillars

The Modern Frontier: Are Neural Codecs a Fourth Pillar?

Where Fora Soft Fits In

What to Read Next

Talk to a Video Engineer · See Our Case Studies · Download the Cheat Sheet

References

Related glossary terms

Arithmetic coding

Bitrate

Block

CABAC

Codec

Deblocking filter