Published 2026-05-16 · 15 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you build, buy, or operate any product that delivers video, psycho-visual redundancy is the lever that decides how much your CDN bills, your encode farm, and your viewer churn really cost. A team that understands which encoder flags drive perceptual quality will ship 1080p streams at a bitrate the next team needs for 720p, and they will do it without their viewers being able to tell. The same understanding tells you when 4:4:4 chroma is a real requirement (post-production, screen sharing, telemedicine) and when it is an expensive vanity feature (mainstream streaming). Skip it, and you over-spend on bits the eye will never see — or you under-spend on the few places it does see, and viewers complain about a "soft" or "blocky" picture they cannot describe. This article walks a non-technical reader from "the eye is not a camera" all the way to a working mental model of psy-rd, AQ modes, and per-title encoding.
What "psycho-visual redundancy" actually means
The phrase has three parts, and the order matters. "Psycho-" refers to perception — what the brain actually does with a signal once the eye has captured it. "Visual" is just a reminder that we are talking about images and not, say, audio masking inside MP3. "Redundancy" is the engineering word for information you can drop without consequence.
Put them together: psycho-visual redundancy is the part of a video signal that the encoder can throw away because the viewer's perceptual system would have thrown it away anyway. It is not the same as the redundancy a codec removes between neighbouring pixels (that's spatial redundancy) or between neighbouring frames (that's temporal redundancy). Those two are statistical — the data is repetitive. Psycho-visual redundancy is perceptual — the data is real, but the viewer would never have noticed it.
A useful analogy. Think of a postal worker delivering a 100-page contract. Spatial and temporal redundancy are the parts of the contract that are duplicates of pages already delivered — the worker can compress those and the recipient still gets the full document. Psycho-visual redundancy is the legalese on page 73 that nobody is going to read; the worker can replace it with a one-line summary and the recipient's behaviour does not change. The eye, like the contract reader, has a budget of attention. A codec that knows where that budget is being spent can stop wasting bits on the bits the eye skips.
This is the third pillar of every modern codec, alongside spatial and temporal compression. It was Claude Shannon's information-theory framing — "irrelevant" information versus "redundant" information — that made the distinction explicit, and JPEG was the first widely deployed standard to take psycho-visual modelling seriously, in 1992.1 Every codec since has built on the same idea, and every codec generation has refined the model: MPEG-2 added scene-adaptive quantisation, H.264 added in-loop psychovisual-aware deblocking, HEVC added more precise quantisation matrices, and AV1 added saliency-driven adaptive quantisation segments.
Figure 1. The three forms of redundancy a codec exploits. The first two are statistical patterns in the signal; the third is a property of the viewer.
How the eye actually sees
Before walking through the tricks, you need a rough mental model of what the eye does well and what it does poorly. The human retina has two main photoreceptor populations. Rod cells dominate peripheral and low-light vision and are colour-blind; they outnumber cone cells about 18-to-1.2 Cone cells handle daylight vision and come in three subtypes that respond to long, medium, and short wavelengths — what we summarise as red, green, and blue. Crucially, the three subtypes are not present in equal numbers, and they are concentrated in a small central patch of retina called the fovea. Outside the fovea, your colour resolution collapses fast.
The brain integrates the rod and cone signals into two roughly independent channels, named after the broadcast engineers who first measured them. One channel — luma — carries brightness, the thing rod cells are good at, with high spatial resolution and high contrast sensitivity. The other channel — chroma — carries colour, with about half the spatial resolution of luma and a noticeably lower sensitivity to fine detail.3 The split is not a codec invention; it is a property of the visual pathway, measurable with psychophysical experiments going back to the 1950s.
Three concrete consequences fall out of that split. First, you can throw away most of the chroma resolution and the viewer will not see the change. This is the foundation of chroma subsampling and the reason 4:2:0 ships in essentially every consumer video format. Second, the eye is much more sensitive to luma at low and middle spatial frequencies than to luma at high spatial frequencies. The number that captures this — sensitivity as a function of cycles-per-degree — is called the contrast sensitivity function, or CSF, and it peaks around 4–6 cycles per degree, then falls off sharply.4 Third, the eye's sensitivity to error in any one region drops sharply when that region is busy, dark, or moving fast, because higher-priority signal in the same neighbourhood crowds out the perception of the noise — a phenomenon called masking.
A practical thought experiment. Hold a sheet of paper at arm's length and read these letters — easy. Stand up, walk past the paper at full speed, and try to read the same letters — impossible, even though the eye received exactly the same photons. The brain stopped resolving fine detail in a moving stimulus, freeing perceptual capacity for the bigger picture. A codec that knows this can spend fewer bits on a fast-pan frame and the viewer will not notice.
Figure 2. The eye's contrast sensitivity is not flat. It peaks in the middle of the frequency range and falls off at both ends — the foundation of frequency-weighted quantisation.
Trick 1 — Chroma subsampling: half the colour data, no visible loss
The first and biggest psycho-visual trick a codec plays is to throw away three-quarters of the chroma data before any actual compression has happened. The mechanism is straightforward. The signal arrives in some RGB form from the sensor; the encoder converts it to a luma-plus-two-chroma representation (Y′CbCr, or its newer relatives BT.709 and BT.2020); then the chroma planes are downsampled. The dominant pattern is 4:2:0 — full-resolution luma, chroma at half resolution horizontally and vertically. A 1920×1080 luma plane keeps every pixel; the two chroma planes carry only 960×540 samples each.
Run the maths once. A raw 4:4:4 1080p frame at 8 bits per sample costs 1920 × 1080 × 3 × 8 = 49.8 Mbit per frame. The same frame at 4:2:0 is (1920 × 1080 + 2 × 960 × 540) × 8 = 24.9 Mbit — exactly half.5 The codec has not done a single lossy compression step yet and has already shrunk the picture by 50%, and yet on a normal display at a normal viewing distance a non-expert viewer cannot reliably tell 4:4:4 from 4:2:0. That is psycho-visual redundancy in its purest form: the saved data was real, but the viewer's perceptual system was never going to use it.
A short reality check. Chroma subsampling is not free everywhere. On a screen-share with sharp red text on a blue background, the chroma-edge resolution matters and 4:2:0 produces visible coloured fringing along the letter edges. The same effect appears in fine, saturated patterns — woven cloth, computer-generated graphics, animation linework, telemedicine images of skin lesions where the diagnostic value is in the colour boundary. Those workflows ship 4:2:2 (half chroma horizontally, full vertically) or 4:4:4 (no chroma reduction at all). The trade-off looks like this:
| Sampling | Luma | Chroma per plane | Bits-per-pixel cost | Visible cost on natural images | Where it ships |
|---|---|---|---|---|---|
| 4:4:4 | full | full | 100% | none | medical imaging, post-production, screen capture |
| 4:2:2 | full | 1/2 horizontal | 67% | imperceptible on most content | broadcast contribution, professional cameras |
| 4:2:0 | full | 1/4 (both axes) | 50% | hard to see on natural video at normal viewing distance | every streaming codec; YouTube, Netflix, broadcast delivery |
| 4:1:1 | full | 1/4 horizontal | 50% | visible on red diagonal edges | legacy DV / NTSC tape |
The lesson for product teams: 4:2:0 is the right default for any delivery pipeline, and chasing 4:4:4 in a consumer streaming context is almost always a bit-cost mistake. The exception is verticals where the value of the image lives in the chroma channel, and then 4:2:2 is usually enough.
Figure 3. The three chroma layouts that matter in practice. The lost chroma resolution sits below the eye's threshold on natural content; the saved bytes are real money.
Trick 2 — Quantisation matrices: spending bits where the eye is sensitive
After the chroma plane has been thinned, the codec splits each frame into small blocks and runs a frequency transform on each one. The transform — a Discrete Cosine Transform in classic codecs, integer approximations in modern ones — converts the block of pixels into a block of coefficients, one per spatial-frequency component. The top-left coefficient is the "DC" term, the block's average brightness; the bottom-right coefficient is the highest-frequency detail, the finest texture in the block.
The encoder then quantises every coefficient — divides it by a step size and rounds. Quantisation is where lossy compression actually happens, and it is the lever the psycho-visual model pulls hardest. The trick: instead of using the same step size for every coefficient, the codec uses a different step size per frequency — a small step (preserve precision) for the frequencies the eye is sensitive to, and a large step (throw away precision) for the frequencies it is not. The table of step sizes is the quantisation matrix.
The original JPEG quantisation matrix from 1992 was the first widely deployed psycho-visual table.6 It was built from psychophysical experiments — show viewers two patterns, ask at what amplitude they can tell them apart, and use that threshold as the step size. The matrix looks like this when normalised:
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
Read it as a heat map. Low frequencies (top-left) get small step sizes — 11 to 16 — so they survive almost intact. High frequencies (bottom-right) get step sizes of 99 to 121 — six to eight times harsher — and most of those coefficients quantise straight to zero. The matrix is not symmetric: horizontal frequencies get slightly different weights from vertical frequencies, and diagonal frequencies harshest of all, because the eye's contrast sensitivity is itself anisotropic. The result: about 90% of the visible information sits in the top-left third of the block, and the codec spends 90% of its bits exactly there.
Modern codecs ship their own scaling lists rather than the JPEG matrix, but the idea is identical. H.264 ships an 8×8 default scaling list derived from CSF measurements; HEVC adds support for separate matrices per transform size and per slice type; VVC and AV1 carry quantisation matrices keyed by both block size and prediction mode.7 The matrices have become more granular, but the philosophy has not changed in 34 years: quantise the eye's blind spots first.
A common mistake is to call quantisation matrices a "compression algorithm". They are not. They are a bit-budgeting policy layered on top of quantisation; the math of dividing-and-rounding is identical. What the matrix changes is which coefficients survive at any given target bitrate. Disable the psycho-visual matrix (encode with a flat scaling list) and you spend an even number of bits on every frequency, including the ones the eye cannot see — same bitrate, lower perceived quality. Most encoders default to "on", which is why you usually do not have to think about it; but if you ever inherit an encoder profile that switched matrices off "for compatibility", switching them back on is one of the cheapest quality wins in the catalogue.
Trick 3 — Adaptive quantisation: bigger errors where the eye won't see them
Quantisation matrices change the bit budget within a block, by frequency. Adaptive quantisation (AQ) changes the bit budget across a frame, by location. The principle: spend more bits on the regions the eye watches, and fewer on the regions it does not.
Three properties make a region eye-friendly to under-encode: high spatial variance (texture), low luma (dark areas), and high local motion. The first one — variance — is the most powerful, because it triggers contrast masking. The eye's ability to detect a defect inside a busy texture is much lower than its ability to detect the same defect on a flat surface. Show somebody a flat blue sky with a faint compression block: they spot it instantly. Show them the same block inside a frame of woven fabric or moving leaves: they cannot find it.
Open-source encoders expose this lever directly. SVT-AV1's variance-based AQ uses 8×8, 16×16, 32×32, and 64×64 block variances as a proxy for contrast and shifts the QP up or down per block accordingly.8 x264 calls the same idea "AQ mode 1" (variance AQ) and "AQ mode 2" (auto-variance AQ), with strength controlled by the --aq-strength parameter; default values cover 60–80% of the perceptual win available without making the picture look uneven.9 x265's variant adds saliency input — an estimate of where viewers are likely to look — pulled from machine-learning models trained on eye-tracking data. The savings on real content are not theoretical: published measurements show 10–25% bitrate reduction at the same VMAF score, and 6% additional savings even when AQ is layered on top of an already-tuned HEVC pipeline.10
The second axis — luma — exploits the eye's lower sensitivity to noise in dark regions. The mechanism is the same: in a shadow, the local brightness is low and the noise floor of the cone cells is high, so a quantisation error of, say, 4 levels is invisible. Push the QP up in shadows and you save bits without visible cost. AV1 even ships a dedicated --enable-dark-region-boost flag in some implementations.
The third axis — motion — is the most counter-intuitive. A fast-moving region has high temporal energy, and the eye's tracking system trades spatial acuity for stable fixation. The result is temporal masking: a few frames of quantisation error during a hard cut or a fast pan are invisible. Modern encoders detect scene cuts and increase QP for the first one or two B-frames after the cut, knowing the viewer's perceptual system is busy re-acquiring the new scene. The effect is large enough that scene-cut quantisation is one of the standard tools in every commercial codec.
Figure 4. An adaptive-quantisation map for one frame. The encoder is making four bets at once: more bits to the sky, fewer to the texture, fewer to the dark, fewer to the motion.
The fourth lever — psy-rd and "the eye prefers detail over fidelity"
There is one more psycho-visual tool worth knowing, because it changes how the encoder makes its core decision: which way to compress this block? That decision is called rate-distortion optimisation (RDO) — pick the coding mode that gives the best balance between bits spent and distortion produced. The default "distortion" measure is plain sum-of-squared-error against the source: a literal pixel-by-pixel comparison.
The problem with plain SSE is that it rewards blurry reconstructions. A flat, smoothed block has low SSE against the original even if it has lost all the texture; a sharp, slightly-wrong block has high SSE even if it looks more like the source to a viewer. Left to its own devices, an SSE-optimising encoder will blur the picture every chance it gets, because blur scores well by the math. Viewers hate it. They prefer a slightly-wrong-but-detailed reconstruction over a faithful-but-mushy one, every time.11
The fix is to add a second term to the cost function: a penalty for blocks whose visual "energy" (high-frequency content) differs from the source's. The x264 implementation called this psy-rd and shipped with a default strength of 1.0; x265 reworked it for HEVC; AV1 encoders carry a similar mechanism under different names (psy-rdoq, psy-tune). The effect is exactly the one the math implies: the encoder spends slightly more bits to keep texture and slightly less on hitting the exact pixel values. Objective metrics (PSNR, the simplest pixel-wise measure) drop a fraction; perceptual metrics (SSIM, VMAF) and human raters rise. It is one of the few cases in video where lower PSNR really does mean better quality.11
The lesson is bigger than one parameter. PSNR is a literalist; the human eye is not. A psycho-visual encoder is built around the second fact, and any toolchain whose only quality target is PSNR is leaving 5–15% of perceived quality on the floor. This is exactly the gap that VMAF and other perceptual metrics were invented to close.
Where Fora Soft fits in
We build video-streaming, video-conferencing, telemedicine, and OTT systems where psycho-visual encoding decisions are not academic — they are the difference between a profitable bitrate ladder and one that hurts the business case. In our streaming deployments we routinely tune chroma format, AQ strength, and psy-rd values per content category, because a music-video service and a security-surveillance archive want completely different defaults; the surveillance archive cares about high-frequency detail (a license plate), the music video cares about smooth colour (a stage light). In telemedicine and AR/VR projects the trade-offs shift again — diagnostic chroma matters in a dermatology stream, peripheral acuity does not. The pattern under all of it is the same: know what the eye will actually look at, then quantise around that knowledge.
A common mistake: turning psycho-visual features off "to look at the data"
Engineers under quality pressure sometimes disable adaptive quantisation, psy-rd, and quantisation matrices "to get a clean baseline" — and then publish bitrate numbers from that baseline as the project default. The numbers are not wrong, but they are not the numbers a real user would ever see, because every real-world streaming pipeline ships with these features on. A baseline that turns them off has invented a pessimistic codec that is 10–25% worse than the one your viewers actually receive, and any feature you compare against it will look better than it really is.
If you must benchmark against a "clean" reference, fine — but document which psycho-visual features were on or off, and never compare a "clean" run against a "tuned" run without saying which is which. The internal language for this is "feature parity": every run in the comparison should have the same psycho-visual toggle set, or the numbers cannot be compared.
What to read next
- Spatial pixel redundancy: what's inside a single frame
- Temporal pixel correlation: redundancy between frames
- Objective quality metrics: PSNR, SSIM, MS-SSIM, VMAF
Talk to us / See our work / Download
- Talk to a video engineer — book a 30-minute call to review your encoding pipeline for psycho-visual wins.
- See our case studies — Fora Soft's streaming and OTT projects (2005–present).
- Download the psycho-visual encoder cheat sheet — one-page reference: which psycho-visual feature exists in x264, x265, SVT-AV1, libvpx, default flags, and the safe ranges per content type.
References
-
Wallace, G. K. (1992). "The JPEG Still Picture Compression Standard." IEEE Transactions on Consumer Electronics, vol. 38, no. 1. https://ieeexplore.ieee.org/document/125072. Accessed 2026-05-16. ↩
-
Curcio, C. A. et al. (1990). "Human photoreceptor topography." Journal of Comparative Neurology. Photoreceptor counts and distribution. https://onlinelibrary.wiley.com/doi/10.1002/cne.902920402. Accessed 2026-05-16. ↩
-
Poynton, C. (2012). Digital Video and HD: Algorithms and Interfaces (2nd ed.). Morgan Kaufmann. Luma/chroma separation in the visual system. Accessed 2026-05-16. ↩
-
Mannos, J. L., Sakrison, D. J. (1974). "The Effects of a Visual Fidelity Criterion on the Encoding of Images." IEEE Transactions on Information Theory. Original CSF measurements used in image-coding research. Accessed 2026-05-16. ↩
-
ITU-R BT.601 / BT.709 specifications on chroma sampling structures. https://www.itu.int/rec/R-REC-BT.709/. Accessed 2026-05-16. ↩
-
CCIR / CCITT JPEG specification (1992), Annex K, Tables K.1 / K.2 — default luminance and chrominance quantisation matrices. https://www.w3.org/Graphics/JPEG/itu-t81.pdf. Accessed 2026-05-16. ↩
-
Sullivan, G. J., Ohm, J.-R., Han, W.-J., Wiegand, T. (2012). "Overview of the High Efficiency Video Coding (HEVC) Standard." IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12. Scaling list and quantisation matrix overview. https://ieeexplore.ieee.org/document/6316136. Accessed 2026-05-16. ↩
-
SVT-AV1 documentation — Appendix: Variance-Based Adaptive Quantization. https://gitlab.com/AOMediaCodec/SVT-AV1/-/blob/master/Docs/Appendix-Variance-Based-Adaptive-Quantization.md. Accessed 2026-05-16. ↩
-
x264 manual (Debian, current build) —
--aq-mode,--aq-strengthparameters. https://manpages.debian.org/experimental/x264/x264.1.en.html. Accessed 2026-05-16. ↩ -
Adzic, V., Kalva, H., Furht, B. (2013). "Exploring visual temporal masking for video compression." Up to 6% additional bitrate savings on top of state-of-the-art HEVC. https://www.cse.fau.edu/~hari/files/2028/Adzic%20et%20al_2013_Exploring%20visual%20temporal%20masking%20for%20video%20compression.pdf. Accessed 2026-05-16. ↩
-
Wang, S., Zeng, K., Rehman, A., Wang, Z. (2017). "Perceptual Evaluation of Psychovisual Rate-Distortion Enhancement in Video Coding." University of Waterloo. Confirms psy-rd lowers PSNR but improves perceptual quality. https://ece.uwaterloo.ca/~z70wang/publications/HVEI17_PsyRD.pdf. Accessed 2026-05-16. ↩↩


