MS-SSIM — Multi-Scale Structural Similarity Index — is an upgraded version of SSIM that measures perceptual quality at multiple zoom levels at once. Where SSIM compares two images at a single resolution, MS-SSIM downsamples both source and compressed versions to several lower resolutions, runs SSIM at each scale, and combines the results into a single score. The motivation is that human perception of distortions varies with the size of the feature being affected — we notice a smudged face more easily than a smudged blade of grass — and a multi-scale measure captures that better than a single-scale one.

The score range is the same as SSIM: 0 to 1, where 1.0 means identical. Anything above 0.97 looks essentially flawless on natural content; 0.93–0.97 is good streaming quality; below 0.90 starts to show. MS-SSIM consistently correlates better with human judgments than plain SSIM, especially on encodes that have different artefact types — denoising losses, motion blur, fine-texture loss — because the multi-scale design naturally weighs each category appropriately. The trade-off is computational cost: MS-SSIM is several times slower to compute than SSIM, though still fast enough to run on every frame in a production pipeline.

For a product team, MS-SSIM is the metric to choose when SSIM isn't sensitive enough but VMAF is overkill or unavailable. In practice, the typical encoding pipeline reports PSNR (fast sanity check), SSIM or MS-SSIM (day-to-day comparison), and VMAF (final authoritative quality measurement). When you read codec research papers, MS-SSIM is the most-cited objective metric after PSNR because it's standardised, well-understood, and predates VMAF — which means historical comparisons across decades of codec generations are usually done in MS-SSIM. FFmpeg computes both SSIM and MS-SSIM via the ssim and ssim_stats filters.