Published: 2026-06-05 · Reading time: 22 min read · Author: Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you build, buy, or operate a video product, your engineers will talk about codecs in a vocabulary — "masking", "MDCT", "LPC", "entropy stage" — that sounds like four unrelated mysteries. It is actually one mystery told four ways, and a product manager who holds the four ideas can follow a codec discussion, ask sharper questions, and understand why one codec sounds better than another at the same bitrate. This article is the conceptual key to the entire codec block of Learn: read it once, and the deep dives on AAC, Opus, and the rest stop being lists of acronyms and become variations on a theme you already understand. We wrote it for a smart reader with zero audio background; every technical claim is checked against the primary specification, not a secondhand summary.
First, what "compression" is actually doing
Before the four ideas, fix the problem they solve. Raw digital audio — the kind described in What is digital audio — is stored as pulse-code modulation, or PCM: the height of the sound wave measured tens of thousands of times per second and written down as plain numbers. PCM is honest and enormous. Do the arithmetic for one CD-quality stereo track:
44,100 samples/s × 16 bits × 2 channels = 1,411,200 bits/s ≈ 1,411 kbps
That is roughly ten megabytes per minute, for one stereo track. A codec's job is to get that 1,411 kbps down to something like 128 kbps — an eleven-fold shrink — while the listener hears almost no difference. The word codec is just "coder-decoder": the coder compresses, the decoder rebuilds.
There are two honest ways to shrink, and the difference matters for the rest of the article. Lossless compression packs the audio without discarding a single sample, the way a ZIP file packs a document — FLAC does this, and it gets about a 2:1 shrink, no more, because that is all the redundancy a faithful recording contains. Lossy compression permanently deletes detail the ear is unlikely to notice, which buys a far bigger shrink — 11:1 and beyond — at the cost of throwing real data away forever. MP3, AAC, and Opus are all lossy. Three of our four ideas exist to decide what a lossy codec can safely delete; the fourth, entropy coding, is the lossless cleanup that both kinds share.
Figure 1. The four ideas at a glance. Masking and the MDCT power the music tribe, linear prediction powers the speech tribe, and entropy coding is the lossless final pass every codec runs.
Idea 1 — Psychoacoustic masking: delete what the ear cannot hear
The first and most important idea is not about mathematics. It is about the limits of human hearing. Psychoacoustic masking — "psychoacoustic" simply means how the ear and brain actually perceive sound — is the fact that a loud sound hides nearby quiet sounds, so the quiet ones can be deleted and nobody notices.
You already know the effect from daily life. A whispered conversation is perfectly audible in a quiet room and completely lost next to a passing motorcycle. The motorcycle did not make the whisper quieter; it made your ear unable to hear it. That is masking, and a lossy music codec spends most of its cleverness measuring exactly which sounds are masked so it can throw them away for free.
Masking comes in two kinds. Simultaneous masking happens in frequency: a loud tone hides quieter tones close to it in pitch, played at the same moment. Temporal masking happens in time: a loud sound hides quiet sounds just before and just after it — forward masking, where a loud burst raises the hearing threshold for the next sounds, lasts roughly 50 to 200 milliseconds after the burst ends. A codec exploits both. It looks at each slice of audio, builds a masking threshold — a line below which sound is inaudible because something louder is hiding it — and stores nothing below that line.
The reason this works at all is the physical design of the ear. The inner ear, the cochlea, does not hear every frequency with equal sharpness. It groups frequencies into roughly two dozen critical bands, each acting like a single blurry filter. The standard model, the Bark scale proposed by Eberhard Zwicker in 1961, divides the audible range into 24 critical bands. The MP3 standard (ISO/IEC 11172-3, 1993) used 25 such bands to estimate masking and decide where to spend bits. Within one critical band the ear cannot separate a quiet sound from a loud one — so the codec doesn't bother to store the quiet one.
Here is the practical consequence, with numbers. Suppose a loud 1,000 Hz tone is playing at 80 decibels. The masking threshold it creates might sit at, say, 40 decibels for nearby frequencies. Any sound in that neighbourhood quieter than 40 decibels is inaudible — the codec rounds it to nothing and spends those bits elsewhere. Multiply that decision across every critical band, twenty-some times per slice, dozens of slices per second, and you have removed most of the data while removing almost none of the perceived sound.
Pitfall — masking is content-dependent, so "transparent" is not a fixed bitrate. A dense rock mix with sound everywhere masks a lot, so a codec can compress it hard and still sound clean. A solo harpsichord or a held flute note has wide silent gaps and few maskers, so the same codec at the same bitrate exposes its artifacts — a "warbling" or "ringing" around the note. This is why codec quality is tested on hard material, not easy material, and why "128 kbps sounds transparent" is true for pop and false for a sparse classical recording.
Idea 2 — The frequency transform (MDCT): the tool that makes masking usable
Masking is the why; the frequency transform is the how. To delete the sounds hidden behind a masker, a codec first has to see the audio as a set of frequencies — because masking is a frequency-by-frequency phenomenon. The tool that does this is a mathematical operation called the modified discrete cosine transform, or MDCT.
Start with the everyday analogy. A microphone records sound as a single wiggling line — pressure over time. But the ear hears pitches: the low thump of a kick drum, the mid-range of a voice, the sparkle of a cymbal, all at once. A frequency transform is the translation between those two views. It takes a short slice of the wiggling line and reports back: "this slice is mostly a strong low frequency, a medium voice band, and a faint high sparkle." Once the audio is described as a list of frequency strengths, the masking model can look at each frequency and decide, band by band, what to keep and what to drop.
The MDCT is the specific transform nearly every music codec uses — MP3, AAC, AC-3, Vorbis, and the music half of Opus all run on it. It was introduced by John Princen and Alan Bradley in 1986, with the oddly-stacked refinement by Princen, Johnson, and Bradley in 1987. Two properties make it the right tool, and both are worth understanding in plain terms.
First, the MDCT processes overlapping slices. Adjacent slices share 50% of their samples — each block of audio is covered by two windows that fade into each other. If a codec chopped audio into hard, non-overlapping blocks, the seam between blocks would click audibly on playback. The overlap smooths the seams.
Second — and this is the elegant part — even though the slices overlap by half, the MDCT does not store twice as much data. It is critically sampled: a slice of 2N input samples produces only N output numbers. Half the information looks like it should be missing. It is recovered on playback by a trick called time-domain aliasing cancellation: the overlap from one slice carries exactly the error needed to cancel the overlap error in the next, as long as the fade-in and fade-out windows satisfy the Princen–Bradley condition, w(n)² + w(n+N)² = 1. The reader does not need the equation; the takeaway is that the MDCT gives you smooth, seamless slices and no storage penalty for the overlap. That combination is why it became the universal engine of music coding.
Figure 2. The MDCT turns a slice of waveform into a frequency spectrum; the masking model draws a threshold curve; everything below the curve is inaudible and discarded.
There is one more job the transform does for free: it concentrates the audio's energy into a few large numbers and many tiny ones. Most frequency slots end up near zero. That sets up the fourth idea — entropy coding — which loves long runs of near-zero numbers. But first, the other tribe.
Idea 3 — Linear prediction: model the voice instead of storing it
The first two ideas come from the music tribe of codecs, which never tries to understand what made the sound — it just transforms and masks any sound at all. The third idea comes from the rival speech tribe, and it works in the opposite direction: instead of storing the sound, it builds a small working model of the human voice and stores only the corrections.
The idea is called linear prediction, and the insight behind it is that the human voice is not arbitrary noise — it is made by one throat with known physics. The vocal cords buzz, producing a raw tone; the throat, mouth, and tongue shape that tone into vowels and consonants. Engineers call this the source-filter model: a source (the buzz, or for consonants a hiss) passed through a filter (the shape of the mouth). Manfred Schroeder and Bishnu Atal turned this model into a practical codec, code-excited linear prediction (CELP), in 1985, and almost every speech codec since — including the SILK half of Opus — is a descendant.
Here is how it saves bits. A predictor looks at the last handful of audio samples and guesses the next one: a linear predictor of order ten guesses the next sample as a weighted sum of the previous ten. For voice, this guess is remarkably good, because voice changes slowly and smoothly. The codec then stores only two cheap things: the small set of predictor weights (which describe the shape of the mouth right now) and the residual — the difference between the guess and the truth. Because the guess is good, the residual is tiny, and tiny numbers cost few bits.
Walk the numbers. If raw voice at a given moment needs, say, twelve bits per sample to store honestly, but the predictor's guess is wrong by only a small amount most of the time, the residual might need just two or three bits per sample, plus a handful of bits per slice for the predictor weights. That is the speech tribe's whole trick: don't transmit the voice, transmit the error in predicting the voice, plus a recipe for the mouth that made it. It is why a phone call fits in a few kilobits per second when the same voice as raw PCM would need dozens.
This is also why speech codecs and music codecs were incompatible for forty years. Linear prediction assumes the sound came from one voice-like source; feed it an orchestra and the prediction collapses, because forty instruments are not one buzzing throat. Masking-and-transform codecs assume nothing about the source, so they handle music but waste bits on a lone voice they could have predicted. Each tribe was excellent at its own job and poor at the other's. The full story of that split, and the history of audio codecs that merged them, is its own article.
Pitfall — the wrong tribe for the content sounds bad in a specific way. A pure speech codec asked to carry on-hold music produces a watery, robotic version of the melody, because its voice model has no idea what to predict. A pure music codec asked to carry a single voice at a very low bitrate produces a "swirly" artifact around the speech. The fix is not a better bitrate; it is the right idea for the content — which is exactly the problem Opus solved by carrying both.
Idea 4 — Entropy coding: pack the leftovers without losing a bit
The first three ideas decide what to keep. The fourth idea, entropy coding, packs whatever survived as tightly as mathematics allows — and unlike the others, it is perfectly lossless. It throws nothing away; it just removes the wastefulness in how the surviving numbers are written down.
The plain-language version: after masking and the transform, most of the frequency numbers are zero or near-zero, and a few are large. Storing every number with the same fixed width — say, sixteen bits each — is wasteful, because the common small values do not need sixteen bits and the rare large ones are rare. Entropy coding assigns short codes to common values and long codes to rare ones, so the total shrinks. It is the same principle as Morse code giving the common letter "E" a single dot and the rare "Q" a long dash-dash-dot-dash.
Two methods dominate. Huffman coding builds a tree of variable-length codes and is what MP3 and AAC use — fast and simple, with a small inefficiency because each code must be a whole number of bits. Arithmetic coding, and its close cousin the range coder that Opus uses, encodes an entire slice as one very long fractional number and can get closer to the theoretical limit, at slightly higher computational cost. That theoretical limit is not a marketing claim; it is Shannon's source coding theorem, which sets the hard floor on how few bits a stream of symbols can take. Entropy coders are judged by how close to Shannon's floor they get.
A short worked example makes the saving concrete. Suppose a slice has 100 frequency numbers and 90 of them are zero. A fixed 8-bit-per-number scheme costs 100 × 8 = 800 bits. An entropy coder that gives the value "zero" a 1-bit code and spends an average of 9 bits on each of the 10 non-zero values costs about 90 × 1 + 10 × 9 = 180 bits — a better-than-4:1 saving, with zero loss of accuracy. The transform set this up by pushing energy into a few numbers; the entropy coder cashes it in.
Crucially, entropy coding is the one idea shared by every codec, lossy and lossless. FLAC and ALAC, the lossless formats, are essentially "linear prediction plus entropy coding with the lossy step removed" — they predict each sample and entropy-code the residual, never discarding anything. That is why lossless tops out near 2:1: with nothing thrown away, the entropy coder can only remove genuine redundancy, not perceptual slack.
Putting the four ideas together
Now assemble them. A modern lossy music codec runs three of the four in sequence: the MDCT turns each slice into frequencies, the masking model decides how coarsely each frequency can be stored (or whether to drop it), and entropy coding packs the result. A modern speech codec runs a different three: linear prediction models the voice and produces a small residual, a lighter masking-style weighting decides how to spend bits on that residual, and entropy coding packs it. The two recipes share only the last step — which is exactly why the two tribes stayed apart so long.
The table below shows which ideas each well-known codec leans on. "Primary engine" is the idea doing most of the work.
| Codec | Primary engine | Masking | Linear prediction | Entropy stage | Tribe |
|---|---|---|---|---|---|
| MP3 | MDCT + masking | yes | no | Huffman | music |
| AAC-LC | MDCT + masking | yes | no | Huffman | music |
| AC-3 (Dolby Digital) | MDCT + masking | yes | no | custom | music |
| Vorbis | MDCT + masking | yes | no | codebook (Huffman-like) | music |
| G.711 | none (companded PCM) | no | no | none | speech |
| CELP-family speech | linear prediction | light | yes | varies | speech |
| Opus (SILK mode) | linear prediction | light | yes | range coder | speech |
| Opus (CELT mode) | MDCT + masking | yes | no | range coder | music |
| xHE-AAC / USAC | MDCT and prediction | yes | yes | arithmetic | both |
Two rows in that table are the punchline of the whole codec story. Opus carries both a linear-prediction speech mode (SILK) and an MDCT music mode (CELT) and switches between them — or blends them in a hybrid mode — inside one stream, all packed by a single range coder. xHE-AAC (the USAC standard, ISO/IEC 23003-3) does the same trick from the MPEG side, combining transform coding with prediction. Both arrived in 2012, and both work because they finally hold all four ideas at once. The deep mechanics of that switch live in Opus: the open codec that ate WebRTC.
Figure 3. The music pipeline and the speech pipeline share only their final entropy stage. A unifier like Opus carries both and switches per slice.
How much does each idea buy you?
It helps to see the rough contribution of each idea to the final shrink, using CD-quality stereo (1,411 kbps) as the starting point and a typical 128 kbps music codec as the destination. These are illustrative magnitudes, not exact figures — the split shifts with content — but they show where the bits actually go.
The transform itself saves almost nothing on its own; its job is to reorganise the audio so the next two ideas can work. Masking does the heavy lifting on music: discarding inaudible frequencies is where most of the 11:1 shrink comes from. Entropy coding then recovers another chunk by packing the survivors — often a further 20 to 40 percent on top of what masking achieved. For speech, linear prediction replaces masking as the heavy lifter, and the shrink is even more dramatic — voice at 1,411 kbps PCM becomes intelligible at 6 to 16 kbps, a hundred-fold reduction, because the voice model is so much more powerful than a generic transform.
Figure 4. Where the bits go. Masking does most of the work for music; linear prediction does it for speech; entropy coding cleans up both.
A worked example: one slice, start to finish
To make the four ideas concrete, follow a single 20-millisecond slice of stereo music through a music codec.
The slice arrives as PCM: at 48 kHz, 20 ms is 960 samples per channel, each 16 bits, so 960 × 16 × 2 = 30,720 bits of raw data for this one slice. Step one, the MDCT: each channel's 960 samples (windowed with its neighbour) become roughly 960 frequency numbers — same count, new view, energy now concentrated in a handful of them. Step two, masking: the model finds the loud frequencies, draws the masking threshold beneath them, and discards or coarsely rounds every frequency below the line; perhaps 700 of the 960 numbers drop to zero. Step three, entropy coding: the surviving ~260 numbers plus the long run of zeros are Huffman- or range-coded, the common zeros taking a fraction of a bit each. The slice that entered as 30,720 bits leaves as roughly 2,700 bits — and at 50 slices per second that lands the stream near 135 kbps per the two channels combined. Nothing in the chain understood "music"; it only understood the ear.
A speech codec would have taken the same slice down a different road: predict each sample from the last ten, store the small residual and the predictor weights, entropy-code those — and reached an even smaller number, because a 20 ms slice of one voice is far more predictable than a 20 ms slice of a full mix.
Where Fora Soft fits in
These four ideas are not academic to us; they decide how products we build actually sound. In real-time work — video conferencing, telemedicine, e-learning, live shopping — we lean on the speech-and-music unifier Opus precisely because it carries linear prediction for voice and the MDCT for the occasional shared music or screen audio, switching automatically as the call content changes. In streaming and OTT work we deal daily with the masking-and-transform family, AAC and its relatives, because that is what ships inside the world's video. Knowing which idea a codec leans on tells us, before we write a line of code, where its quality will hold and where it will crack — a sparse classical stream, a noisy clinic, a 50-person classroom each stress a different one of the four. Fora Soft has shipped audio across conferencing, OTT/Internet TV, surveillance, e-learning, telemedicine, and AR/VR since 2005.
What to read next
- A short history of audio codecs: from MP2 (1991) to LC3 (2020)
- Opus: the open codec that ate WebRTC
- The AAC family: AAC-LC, HE-AAC v1, HE-AAC v2, xHE-AAC
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your how audio compression works plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the How audio compression works — four ideas cheat sheet — One-page reference: the four ideas (psychoacoustic masking, MDCT, linear prediction, entropy coding) and a table mapping each major codec to the ideas it uses.
References
- IETF RFC 6716, "Definition of the Opus Audio Codec", September 2012 (updated by RFC 8251, October 2017). Primary source for the SILK linear-prediction mode, the CELT MDCT mode, the hybrid switch, and the range coder used as Opus's entropy stage. https://datatracker.ietf.org/doc/html/rfc6716
- IETF RFC 8251, "Updates to the Opus Audio Codec", October 2017. Confirms RFC 6716 is the current Opus definition as amended. https://datatracker.ietf.org/doc/html/rfc8251
- ISO/IEC 11172-3:1993, "Information technology — Coding of moving pictures and associated audio … Part 3: Audio" (MPEG-1 Audio). Defines MP3 (Layer III), its psychoacoustic model, and the 25-band critical-band partition used for masking. (ISO catalogue; full normative text paywalled — used for the critical-band count and Huffman entropy stage.) https://www.iso.org/standard/22411.html
- ISO/IEC 14496-3 (MPEG-4 Audio), current edition. Home standard of the AAC family; defines the MDCT-based filter bank and the Huffman entropy stage of AAC-LC. (ISO catalogue; normative text paywalled.) https://www.iso.org/standard/76383.html
- ISO/IEC 23003-3 (MPEG-D Unified Speech and Audio Coding, USAC / xHE-AAC). The standard combining transform coding and linear prediction in one codec; cited for the "both tribes in one stream" claim. https://www.iso.org/standard/82983.html
- J. P. Princen and A. B. Bradley, "Analysis/synthesis filter bank design based on time domain aliasing cancellation", IEEE Transactions on ASSP, 1986, with the oddly-stacked refinement in Princen, Johnson & Bradley, ICASSP 1987. Primary source for the MDCT, time-domain aliasing cancellation, and the Princen–Bradley window condition. https://ieeexplore.ieee.org/document/1164954
- M. R. Schroeder and B. S. Atal, "Code-excited linear prediction (CELP): High-quality speech at very low bit rates", ICASSP 1985. Primary source for the linear-prediction / source-filter speech model that underlies SILK and the speech tribe. https://ieeexplore.ieee.org/document/1168147
- E. Zwicker, "Subdivision of the audible frequency range into critical bands (Frequenzgruppen)", Journal of the Acoustical Society of America, 1961. Primary source for the Bark scale and the 24 critical bands of human hearing. https://pubs.aip.org/asa/jasa/article/33/2/248/734163
- C. E. Shannon, "A Mathematical Theory of Communication", Bell System Technical Journal, 1948. The source-coding theorem that sets the lower bound entropy coders approach. https://ieeexplore.ieee.org/document/6773024
- Xiph.Org Foundation — Opus and Vorbis documentation. Reference-implementation context for the range coder and MDCT in open codecs (used as tier-3 corroboration, not as the source of any spec fact). https://opus-codec.org/
- "The Bark Frequency Scale", Julius O. Smith III, CCRMA, Stanford. Educational corroboration for the Bark/critical-band model (orientation only; the spec claim rests on Zwicker 1961). https://ccrma.stanford.edu/~jos/sasp/Bark_Frequency_Scale.html
Source-conflict note: popular articles often state "24 critical bands" as a universal fact; the MP3 standard (ISO/IEC 11172-3) actually partitions into 25 bands for its masking model. The article follows the standard for the MP3-specific claim and Zwicker 1961 for the general human-hearing figure of 24, distinguishing the two rather than collapsing them.


