Masking is the psychological-visual phenomenon where the human eye is much worse at noticing details in busy, complex, or moving parts of an image than in calm, smooth, static parts. Stare at a clear blue sky and you'll spot a single faint dust speck immediately. Stare at a snowstorm and you'd never notice if a few snowflakes were missing or distorted. The eye's sensitivity to artefacts varies enormously by local content, and modern video encoders exploit this aggressively.

Three types of masking matter for video. Spatial masking: highly detailed or textured regions hide compression artefacts (rough fabric, grass, gravel). Temporal masking: fast-moving regions hide artefacts because the eye can't track rapid changes precisely. Luminance masking: very dark or very bright regions hide artefacts because the eye has reduced sensitivity at extreme brightness levels. Each one is a place where the encoder can quantize harder, drop more detail, and save bits — because the viewer simply won't notice.

For a product team, masking is the theoretical foundation for perceptual-quantization and the modern "psycho-visual" encoder modes that make a 4K HDR stream at 15 Mbps look essentially identical to its uncompressed master. Modern x264/x265 with psy-rd, AV1 with its perceptual modes, content-adaptive encoding services from Bitmovin/Mux/NETINT — all of them lean heavily on masking-aware bit allocation. The takeaway: when an encoder vendor talks about "perceptual optimisation" or "psy-tuning", they're talking about masking exploitation. It's invisible to the viewer (that's the whole point) but commercially significant — typical savings of 10–25 % bitrate at the same perceived quality compared to non-perceptual encoding.