Entropy Coding: A Short Introduction (Huffman, Arithmetic, C

Published 2026-05-16 · 16 min read · By Nikolay Sapunov, CEO at Fora Soft

Why this matters

Entropy coding is the last 10–20% of every video bitstream you have ever watched. It does not change what the codec decided to keep or throw away; it only changes how cheaply the decision is recorded. That makes it an unglamorous stage — there is no visual change you can point to — but it is responsible for a measurable chunk of the bitrate, sets a hard ceiling on decoder speed in software and hardware, and is the reason H.264 has two flavours (CAVLC and CABAC) that sit in different product profiles. If you sell, buy, or build streaming, knowing what entropy coding does will save you from buying the wrong encoder preset, the wrong H.264 profile, or the wrong hardware path for live workflows.

What "entropy coding" actually means

The word entropy here comes from a single 1948 paper by Claude Shannon, and once you strip the jargon it is a count: the average number of bits per symbol you would need if you encoded the data optimally, given how often each symbol appears.¹ If the source has only one possible symbol, the entropy is zero — you do not need to transmit anything, because the receiver already knows what is coming. If the source has 256 equally likely symbols, the entropy is 8 bits per symbol — every choice is fully informative and you cannot do better than spelling each one out. Most real data sits between those extremes, and that gap is the room entropy coding lives in.

A short example. Imagine a stream of four possible symbols — call them A, B, C, D — appearing with these probabilities:

A: 50% of the time
B: 25% of the time
C: 12.5% of the time
D: 12.5% of the time

A naive coder would use 2 bits per symbol because there are four options. Shannon's formula tells you the real minimum: the entropy is 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits per symbol. An entropy coder that hits that limit will produce a file 12.5% smaller than the naive one for free — same content, same quality, just smarter bookkeeping.² That 12.5% is real money in storage and CDN bills for anyone moving large volumes of video.

A modern video codec produces an output stream that is heavily biased like the example above. After prediction and quantisation, most numbers are zero, motion vectors cluster around small values, and the choice of "which prediction mode did I just use?" follows a strong pattern. The job of entropy coding is to compress those biases out of the bitstream — losslessly, so the decoder can reconstruct the same numbers the encoder sent.

A useful analogy. Think of Morse code, designed in the 1830s long before information theory existed. The letter E appears most often in English, so it gets a single dot. The letter Q appears rarely, so it gets dash-dash-dot-dash. A telegraph operator who used four dots for every letter would still get the message through, but would spend three times as long on the wire. Entropy coding is the same idea, generalised: short codes for common things, long codes for rare things, and the savings compound across millions of symbols.

A simple stack labelled "what the video codec produces, in order": prediction, transform, quantisation, entropy coding. An arrow on the right says "from billions of pixels to a few megabits per second". The entropy coding box is highlighted, with a callout saying "lossless — last 10–20% of the bitrate". Figure 1. Where entropy coding sits in the encoder pipeline. The earlier stages decide what to keep; entropy coding decides how to write down what was kept.

Idea 1 — Huffman coding (1952)

The first practical entropy coder was published in 1952 by David Huffman, then a graduate student at MIT.³ His algorithm produces an optimal prefix code: a table of variable-length bit sequences, one per source symbol, with two guarantees. First, the more common a symbol is, the shorter its code. Second, no code is a prefix of any other code — so the decoder can read bits left to right and always know where one code stops and the next begins, without a separator.

The construction is short enough to walk through. Each symbol starts as a leaf with its probability attached. The two leaves with the lowest probability are merged into a parent node whose probability is the sum; the new node now stands in for them. Repeat — always merging the two lowest-probability nodes — until one root remains. Read off the path from root to each leaf, calling each left branch "0" and each right branch "1", and the resulting bit string is that symbol's code.

For the A=50%, B=25%, C=12.5%, D=12.5% example, the Huffman code comes out to:

A = 0
B = 10
C = 110
D = 111

The average code length is 0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1.75 bits per symbol — exactly the Shannon entropy, because the probabilities all happen to be powers of one-half. When the probabilities are not nice fractions, Huffman cannot match the entropy exactly, but it never wastes more than one bit per symbol on average.⁴

A binary tree built from four nodes labelled A(0.5), B(0.25), C(0.125), D(0.125). Branches are labelled 0 and 1. The resulting codes are listed in a small table on the right: A=0, B=10, C=110, D=111, average length 1.75 bits/symbol. Figure 2. A Huffman tree for four symbols. Leaves with the lowest probability sit deepest in the tree; the most common symbol gets the shortest code.

Where Huffman wins and where it loses

Huffman is fast — the decoder is essentially a table lookup, and it cost almost nothing to implement on the 1990s silicon that shipped MPEG-2 to set-top boxes around the world. It is also self-synchronising: once a decoder has the table, it can pick the bitstream up at any whole code boundary and resume.

The weakness is the one-bit-per-symbol penalty. Huffman assigns a whole number of bits to each symbol, but the Shannon-optimal code is often a fractional number of bits. If a symbol has probability 0.9, its information content is only 0.15 bits — yet Huffman is forced to spend a full bit on it. Across a video frame where one symbol (say, "no change here, skip block") is dominant, this rounding can leave 10–20% of the available compression on the table.

The fix is to widen the symbol — code pairs or triples of source symbols together, not one at a time. This shrinks the rounding waste, but the code table grows exponentially: 256 symbols become 65 536 pairs, then 16 million triples. At some point the table itself costs more than the savings, and you reach for a different tool.

Idea 2 — Arithmetic coding (1970s)

Arithmetic coding sidesteps the whole-bit limit by refusing to assign each symbol its own code at all. Instead, the whole message is encoded as a single number — a fraction between 0 and 1 — whose precision grows symbol by symbol.

The classic intuition. Start with the interval [0, 1). Slice it into segments proportional to symbol probabilities — A gets [0, 0.5), B gets [0.5, 0.75), C gets [0.75, 0.875), D gets [0.875, 1). When the first symbol arrives, pick the segment that corresponds to it and treat that segment as the new working interval. Slice it again in the same proportions for the second symbol. Pick the matching sub-segment, repeat. After every symbol the interval is narrower; after the whole message it is so narrow that any number inside it identifies the message uniquely. Write that number out in binary and you have your bitstream.

A short numeric run. Take A=50%, B=25%, C=12.5%, D=12.5% and encode the message BAD:

Start:    [0.0,    1.0)
After B:  [0.5,    0.75)         B occupies [0.5, 0.75) of the initial interval
After A:  [0.5,    0.625)        A occupies the first half of [0.5, 0.75)
After D:  [0.609375, 0.625)      D occupies the top 12.5% of [0.5, 0.625)

Any number in [0.609375, 0.625) — say 0.61 — encodes BAD. Three symbols, log2(1/(0.25 × 0.5 × 0.125)) ≈ 5.0 bits of information. Arithmetic coding lands on that number; Huffman could not, because its codes are integer-bit lengths (B=10, A=0, D=111 would have spent 6 bits, not 5).

That is the structural advantage: arithmetic coding can spend a fractional number of bits per symbol, so it tracks the Shannon entropy almost exactly, no matter how skewed the probabilities are.⁴ The cost is computation. Each symbol updates a high-precision interval; in software this is a multiply, an addition, and some renormalisation per symbol, every symbol. In hardware the data dependency between symbols — the interval after symbol N is needed before symbol N+1 can be processed — limits parallelism, and that limit is the central engineering story of the rest of this article.

Why it took 20 years to ship

The math of arithmetic coding was understood by the late 1970s but did not appear in consumer codecs for two decades. There were two reasons. The first was a thicket of IBM patents covering the most efficient implementations, which expired only in the late 1990s. The second was that whole-bit Huffman tables were fast enough on the silicon of the day, and the extra 10–15% from arithmetic coding did not justify the complexity. Once HD video arrived, those 10–15% started to matter, the patents started to expire, and the path was open.

Figure 3. Arithmetic coding narrows the interval one symbol at a time. The final bitstream is any number inside the last interval.

Idea 3 — CABAC (2003): arithmetic coding with context

By the time H.264 / AVC was being finalised, arithmetic coding had been around for decades and the patent landscape was clearing. The H.264 committee took two further steps that turned arithmetic coding from a textbook idea into the most efficient practical entropy coder in video.

Step one — binarisation. Every syntax element the encoder produces, no matter how many possible values it has, is first converted into a sequence of binary decisions. A non-binary symbol becomes a chain of yes/no bits, called bins to distinguish them from output bits.⁵ A motion vector with hundreds of possible values, a transform coefficient, a prediction mode — all get binarised. The arithmetic coder then only ever sees binary input, which simplifies the hardware enormously: every step is just "did I get a 0 or a 1?" with a probability between 0 and 1.

Step two — context modelling. Each bin is coded with its own probability estimate, and the estimate is chosen from a pool of contexts based on what was already encoded nearby. If you are coding the "this block has a non-zero coefficient" flag for a block, the probability the flag is 1 depends on whether the block to the left and the block above also had non-zero coefficients. CABAC keeps a table of probabilities for each context (399 contexts in H.264, 153 in HEVC after redesign for throughput).⁶ Every time a bin is coded, the table entry for its context is updated. The probabilities adapt to the local statistics of the stream as it flows.

This three-stage process — binarisation, context modelling, binary arithmetic coding — is what the letters C-A-B-A-C stand for: Context-Adaptive Binary Arithmetic Coding.

How much does CABAC actually save?

The headline number is 10–15% better compression than the alternative H.264 entropy coder, CAVLC (Context-Adaptive Variable-Length Coding, a Huffman-derivative), at the same picture quality.⁷ In some studies CABAC has shown up to 32% bitrate savings against pure Huffman for the same content.⁸ Most of that gain comes from context modelling — the fact that a transform coefficient flag in a sparse block has a very different probability than the same flag in a dense block, and CABAC tracks both.⁵

The cost is computation. CABAC decoding is famously a throughput bottleneck: the context model needed for bin N+1 depends on the result of bin N, which prevents pipelining and limits parallelism. CABAC was already a well-known bottleneck in H.264, and HEVC inherited the problem.⁹ HEVC's response was to redesign for throughput — fewer contexts, fewer context-coded bins per coefficient (the bulk of the bins now use a faster "bypass" mode that skips context modelling), smaller line buffers, and explicit support for high-level parallelism via tiles and wavefront parallel processing.

A real-world consequence — H.264 profiles

H.264 ships with two entropy coders, not one. CAVLC is required in all profiles. CABAC is only present in Main, High, and above — it is not available in Baseline or Extended.¹⁰ That distinction matters at deployment time. iOS and Android devices have supported Main/High since the early 2010s, so most consumer streaming uses CABAC; very old set-top boxes and low-cost surveillance cameras sometimes still ship Baseline, which means about 10–15% larger files for the same picture. The reason a security camera sometimes "looks worse than a phone at the same bitrate" is often that the camera uses Baseline H.264 and the phone uses High.

Aspect	Huffman / CAVLC	Arithmetic / CABAC
Code length	Whole bits per symbol	Fractional bits per symbol
Distance from Shannon limit	Up to 1 bit per symbol	A few hundredths of a bit
Compression vs CAVLC	Baseline	10–15% smaller in H.264
Decode throughput	High (table lookup)	Limited (serial data dependency)
Adaptation	None (static tables)	Per-bin context update
Codecs that use it	JPEG, MPEG-1/2, H.264 Baseline	H.264 Main/High, HEVC, VVC
Hardware difficulty	Low	High

Table 1. Huffman/CAVLC versus arithmetic/CABAC at a glance.

Figure 4. CABAC's three stages and the feedback loop that updates context probabilities after every bin.

What AV1 did differently — the multi-symbol arithmetic coder

When the Alliance for Open Media (AOMedia) was assembling AV1 in 2017–2018, the team faced two problems with the CABAC approach. The first was patents — CABAC sits inside a thick H.264/HEVC patent pool that AOMedia explicitly wanted to avoid. The second was hardware. CABAC's binary, serial nature made it hard to scale to 4K and 8K, where decoders need to process millions of bins per second.

AV1 adopted daala_ec, a non-binary arithmetic coder from Mozilla's Daala research project, as its replacement.¹¹ Where CABAC handles one binary decision per step, daala_ec handles a multi-symbol alphabet — up to 16 symbols per syntax element — in a single step. Probabilities are stored as 15-bit Cumulative Distribution Functions (CDFs) per context, and they adapt after every coded symbol rather than once per frame.¹²

That single change has two consequences. Bit-level parallelism improves because each step now does the work of several CABAC bins simultaneously, so a hardware decoder can hit the same throughput at a lower clock rate, drawing less power.¹³ Patent avoidance improves because the multi-symbol formulation sits outside the binary-arithmetic-coding patent thicket. Compression efficiency is roughly comparable to CABAC at the syntax-element level; AV1's headline efficiency over HEVC comes mostly from other tools (longer transforms, better prediction, more reference frames), with entropy coding contributing a measurable but smaller slice.

H.266 / VVC, finalised in 2020, took the opposite road: it stayed with binary CABAC but added a multi-hypothesis probability estimator that runs two independent probability updates in parallel and averages them, plus larger context tables. The result is roughly 3–5% of VVC's total bitrate savings traceable to the entropy coder alone.¹⁴

Codec	Entropy coder	Alphabet	Notes
MPEG-2	Huffman (static tables)	Symbol-by-symbol	Set-top box era
H.264 Baseline	CAVLC	Symbol-by-symbol	All-profile fallback
H.264 Main/High	CABAC	Binary	First widespread binary AC
HEVC / H.265	CABAC (redesigned)	Binary	Throughput improvements
VP9	Boolean binary AC	Binary	Predecessor of AV1
AV1	Multi-symbol AC (daala_ec)	Up to 16 symbols	Patent-aware, more parallel
H.266 / VVC	CABAC with multi-hypothesis update	Binary	+3–5% over HEVC entropy

Table 2. Entropy coders by codec generation.

A common pitfall — entropy coding is not the place to save quality

Common mistake: thinking that switching entropy coder from CAVLC to CABAC will visibly improve the image. It will not. Entropy coding is lossless — its only job is to compress what the rest of the pipeline already decided to keep. Picking CABAC instead of CAVLC at the same bitrate gives you the same picture; the gain shows up only as a smaller file or a higher bitrate at the same target file size. If the picture changed, your encoder also changed something else under the hood (different quantisation, different rate control feedback). Conflating these two stages is one of the most common misreadings of encoder benchmarks. The picture-quality decisions live in prediction and quantisation; the bitrate-budget decisions live partly in entropy coding.

A second pitfall: assuming "CABAC is always on" in H.264. It is not — Baseline profile uses CAVLC only, and a surprising number of live-streaming and surveillance setups still emit Baseline by default. Always check the profile of the bitstream you are shipping, not the codec name.

Where Fora Soft fits in

We build video pipelines where the trade-off between encoder profile, entropy coder, and hardware target is a daily question — for video conferencing, video streaming, OTT and Internet TV, video surveillance, e-learning, telemedicine, and AR/VR. We have shipped systems where dropping H.264 Baseline in favour of Main + CABAC cut bandwidth costs by double-digit percent without touching the picture, and others where the throughput cost of CABAC on a constrained ARM SoC forced us back to CAVLC. The right answer is always content-specific and target-specific; the wrong answer is almost always "use the encoder defaults and hope for the best".

Talk to us / See our work / Download

Talk to a video engineer — book a 30-minute scoping call about your encoder profile and bitrate ladder.
See our case studies — Fora Soft portfolio of video-heavy products, including video streaming and WebRTC platforms.
Download the entropy coding cheat sheet — a one-page reference on which entropy coder ships in which codec, default profiles, and quick-pick guidance: Download PDF.

References

Shannon, C. E. (1948), A Mathematical Theory of Communication. Bell System Technical Journal. Foundational paper on entropy. https://en.wikipedia.org/wiki/Entropy_(information_theory) ↩
Shannon's Source Coding Theorem — the lower bound on lossless compression equals the source's entropy. https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem ↩
Huffman, D. A. (1952), A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE. https://en.wikipedia.org/wiki/Huffman_coding ↩
Cover, T. M.; Thomas, J. A., Elements of Information Theory (2nd ed., Wiley, 2006). Standard reference on entropy, Huffman, and arithmetic coding bounds. ↩↩
Marpe, D., Schwarz, H., Wiegand, T. (2003), Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard. IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 620–636. https://iphome.hhi.de/marpe/download/cabac_ieee03.pdf ↩↩
Sze, V., Budagavi, M. (2012), High Throughput CABAC Entropy Coding in HEVC. IEEE Trans. CSVT. MIT version: https://dspace.mit.edu/bitstream/handle/1721.1/100315/hevc_cabac_chapter.pdf ↩
Wikipedia, Context-adaptive binary arithmetic coding — bit-rate savings of CABAC over CAVLC in the range of 10–20% for SD/HD signals. https://en.wikipedia.org/wiki/Context-adaptive_binary_arithmetic_coding ↩
NumberAnalytics, Entropy Coding: The Key to Efficient Data Compression. https://www.numberanalytics.com/blog/entropy-coding-efficient-data-compression ↩
Sze, V., A Comparison of CABAC Throughput for HEVC/H.265 vs. AVC/H.264. MIT EEMS. https://eems.mit.edu/wp-content/uploads/2014/10/sze_sips_2013.pdf ↩
Wikipedia, Context-adaptive variable-length coding — CAVLC is supported in all H.264 profiles; CABAC is restricted to Main and higher. https://en.wikipedia.org/wiki/Context-adaptive_variable-length_coding ↩
AV1 (Wikipedia) — Daala's entropy coder (daala_ec), a non-binary arithmetic coder, was selected for replacing VP9's binary entropy coder. https://en.wikipedia.org/wiki/AV1 ↩
Technical Overview of AV1, arXiv:2008.06091 — AV1 uses a context-based multi-symbol arithmetic coder (MS-AC) with up to 16 symbols per syntax element and 15-bit CDFs. https://arxiv.org/pdf/2008.06091 ↩
Valin, J.-M. et al., An Overview of Core Coding Tools in the AV1 Video Codec. https://www.jmvalin.ca/papers/AV1_tools.pdf ↩
Overview of Versatile Video Coding (H.266/VVC) and Its Coding Performance Analysis — VVC's CABAC retains the binary algorithm but uses a multi-hypothesis probability estimator and expanded context tables, contributing roughly 3–5% of total VVC bitrate savings. https://www.researchgate.net/publication/370714251_Overview_of_Versatile_Video_Coding_H266VVC_and_Its_Coding_Performance_Analysis ↩

Entropy Coding: A Short Introduction (Huffman, Arithmetic, CABAC)

Why this matters

What "entropy coding" actually means