Why this matters
If you compare codecs, read encoder benchmarks, or pick an objective metric for a quality gate, MS-SSIM is one of the three numbers you will meet constantly — alongside SSIM and VMAF — and it is the one most people quote without knowing what the "multi-scale" part actually buys them. It shows up in libvmaf output, in AV1-vs-HEVC comparison tables, and in academic results, usually as a bare number with no explanation of the five scales or the weights underneath. This article gives you that explanation: why one scale is not enough, how the five-scale machine works, the arithmetic shown once on real numbers, and the content where MS-SSIM still lies. It is the bridge between single-scale SSIM and the perception-trained metrics that followed.
The single-scale problem: one window assumes one screen
Start with the gap MS-SSIM was built to close, because the whole design follows from it. Single-scale SSIM — the Structural Similarity Index — slides a small window across the image (an 11×11 patch in the original paper) and at each spot asks whether the local brightness, contrast, and structure survived. Structure is the part that matters: the pattern of light and dark that the eye reads as edges and texture. SSIM measures it at exactly one window size, on the image at its native resolution.
That single window quietly assumes one viewing condition. Whether a given amount of distortion is visible depends on three things the authors of MS-SSIM named directly: the sampling density of the image (how many pixels), the distance from the screen to your eye, and your visual system itself (Wang, Simoncelli, and Bovik, 2003). Move closer and fine detail becomes visible; sit far back and the same fine detail blurs away while coarse structure dominates what you notice. A metric that measures structure at one fixed scale can only be right for one of those situations.
Here is the analogy to carry. Single-scale SSIM is a single fixed-focus camera. It is sharp at exactly one distance and slightly wrong everywhere else. MS-SSIM is the same camera mounted on a rig that also shoots the scene from several distances and combines the shots — so the verdict no longer hinges on guessing the one correct viewing condition in advance.
There is a second, sharper reason single-scale falls short, and the original paper measured it. Compression does not damage every scale equally. Codecs like JPEG, and every modern video codec after it, throw away fine detail far more aggressively than coarse structure. So a compressed frame "looks" more similar to the original the more you blur and shrink it — which means single-scale SSIM gives a systematically different verdict depending on which scale you happen to measure. The fix is not to pick the lucky scale; it is to measure all of them and weight them by how much each matters to the eye.
What MS-SSIM does: measure structure down a pyramid
The machine is simpler than it sounds. Take the reference frame and the distorted frame. At the current size, compute the contrast and structure comparisons — the same two terms SSIM uses, which we will unpack in a second. Then low-pass-filter both frames (a gentle blur that removes the finest detail before it can alias) and downsample them by a factor of 2, halving each side. Repeat. The original paper uses five scales: Scale 1 is the image at full resolution, and each step down is half the width and half the height of the one before, so Scale 5 is the image shrunk by a factor of 16 on each axis (Wang, Simoncelli, and Bovik, 2003).
Figure 1. MS-SSIM runs the structure measurement down a five-level pyramid. Contrast and structure are measured at every scale; brightness is measured once, at the coarsest. The five results combine into one number.
Two design choices in that machine are worth pausing on, because they are what make MS-SSIM different from "run SSIM five times and average."
Contrast and structure at every scale; brightness only at the coarsest. SSIM is built from three comparisons — a luminance term l (did brightness match?), a contrast term c (did the spread of light and dark match?), and a structure term s (did the pattern survive?). The full math and the analogies live in the SSIM article; here you only need the shape. MS-SSIM measures contrast and structure at every one of the five scales, but measures luminance just once, at the smallest scale. The reason is practical: absolute brightness is a low-frequency, large-scale property, so measuring it repeatedly at every scale would double-count it. Measuring it once, at the coarsest level where it lives, is enough.
The combination is a weighted product, not an average. The five-scale result is assembled with this formula (this is the metric's defining equation, from the 2003 paper):
M
MS-SSIM(x,y) = [l_M(x,y)]^αM · ∏ [c_j(x,y)]^βj · [s_j(x,y)]^γj
j=1
Read it in plain language: take the brightness term at the coarsest scale M, and multiply it by the contrast and structure terms from every scale j, each raised to a power that sets how much that scale counts. The powers (α, β, γ) are the weights. To keep things simple the authors set the three exponents equal at each scale (so each scale has one weight), and they normalized the five weights to sum to 1. That normalization is what keeps MS-SSIM on the same 0-to-1 footing as SSIM, and it is why MS-SSIM of two identical images is exactly 1.
A small simplification makes the worked example below cleaner. Because the SSIM constants are chosen so that C3 = C2/2, the contrast term times the structure term collapses into one expression — call it the cs term:
cs(x,y) = c(x,y) · s(x,y) = ( 2·σxy + C2 ) / ( σx² + σy² + C2 )
where σx² and σy² are the variances of the two windows and σxy is their covariance. So at scales 1 through 4, MS-SSIM only needs the cs term; at scale 5 it also folds in luminance, which makes the scale-5 contribution a full single-scale SSIM. That is exactly what the reference implementation computes.
The weights, and why the middle scales win
The five weights are not round numbers, and they are the most interesting part of the metric. They came from a dedicated perception experiment, not a guess. The authors built an image-synthesis test: for ten 64×64 source images they generated distortions confined to a single scale at a time, across twelve distortion levels, and asked eight observers to match images of equal perceived quality across scales, at a fixed viewing distance of 32 pixels per degree of visual angle. Averaging where the observers placed equal-quality images gave the relative importance of each scale (Wang, Simoncelli, and Bovik, 2003). The resulting weights are:
| Scale | Size vs original | Weight (βj = γj) |
|---|---|---|
| Scale 1 (finest) | full resolution | 0.0448 |
| Scale 2 | 1/2 | 0.2856 |
| Scale 3 | 1/4 | 0.3001 |
| Scale 4 | 1/8 | 0.2363 |
| Scale 5 (coarsest) | 1/16 | 0.1333 |
Table 1. The calibrated cross-scale weights from Wang, Simoncelli & Bovik (2003). They sum to ~1.0. The finest scale carries only ~4.5% of the weight; scales 2–4 together carry ~82%.
Figure 2. The scale weights peak in the middle (Scale 3, 0.3001) and fall off toward both the finest and coarsest scales — the same shape as the eye's contrast sensitivity function, which is most sensitive at middle spatial frequencies.
The shape of those weights is the lesson. The finest scale — the full-resolution image, where pixel-level noise and the sharpest detail live — gets only 4.5% of the vote. The middle scales (2, 3, and 4) together carry about 82%. This mirrors a known fact about human vision: the contrast sensitivity function, which describes how sensitive the eye is at different spatial frequencies, peaks at middle frequencies (around four cycles per degree) and falls off toward both very fine and very coarse detail (Wang, Simoncelli, and Bovik, 2003). MS-SSIM did not copy the contrast sensitivity function directly — the paper is careful to say it could not, because that curve is measured at the threshold of visibility with simple sinusoids, not on visibly distorted complex images — but the calibrated weights landed on the same overall shape. That is why MS-SSIM forgives a little fine-grain noise that single-scale SSIM, measuring only the finest scale, punishes hard.
A worked example: watch the scales resolve
Numbers make it concrete. We took a 512×512 reference frame with structure at several scales and compressed it to a low-quality JPEG (quality 15), then ran the from-scratch MS-SSIM script that ships with this article. Here is what each scale reported:
| Scale | Size | cs term (c·s) | Weight |
|---|---|---|---|
| Scale 1 | 512×512 | 0.851 | 0.0448 |
| Scale 2 | 256×256 | 0.953 | 0.2856 |
| Scale 3 | 128×128 | 0.988 | 0.3001 |
| Scale 4 | 64×64 | 0.997 | 0.2363 |
| Scale 5 | 32×32 | 0.9996 (full SSIM, with luminance) | 0.1333 |
Table 2. Per-scale terms for a quality-15 JPEG of a 512×512 frame, from the companion script. The damage is worst at the finest scale and fades as the image shrinks — exactly the pattern the 2003 paper predicted for block-based compression.
Read the table top to bottom and the metric's whole argument is visible. At full resolution the structure is badly hurt — the cs term is 0.851, because JPEG's blocking and ringing wreck fine detail. But as you shrink the frame, that fine damage blurs away and the coarse structure, which JPEG mostly preserved, dominates: 0.953, then 0.988, then 0.997, then near-perfect. Single-scale SSIM, which only ever looks at Scale 1, would report 0.850 and call this a poor encode. Now combine the scales with their weights:
MS-SSIM = 0.851^0.0448 · 0.953^0.2856 · 0.988^0.3001 · 0.997^0.2363 · 0.9996^0.1333
= 0.975
The multi-scale verdict is 0.975, not 0.850. The difference is not the metric being lenient for the sake of it — it is the metric refusing to let the finest scale, which carries only 4.5% of the perceptual weight, dictate the whole score. At a normal viewing distance a person would not stare at this frame pixel-peeped to 512×512; they would see it at some real size where the coarse structure matters more, and MS-SSIM's 0.975 is closer to that judgment than single-scale's 0.850. Run the same machine on a distortion that lives mostly in fine grain — high-frequency noise — and the gap is even wider: single-scale SSIM near 0.876, MS-SSIM near 0.989. The script's --demo mode reproduces exactly that case.
Reading an MS-SSIM score
MS-SSIM runs from 0 to 1, higher is better, and 1 means identical — the same scale as SSIM, which is part of why the two get confused. But the numbers are not interchangeable. For the same content and the same distortion, MS-SSIM usually reads higher than single-scale SSIM, because the coarse scales (where compression does less damage) pull the weighted product up, as the worked example just showed. A team that quietly swaps single-scale SSIM for MS-SSIM in a dashboard will see every score jump and may think quality improved when only the metric changed.
Two practical consequences follow. First, never compare a single-scale SSIM number against an MS-SSIM number — they answer different questions and live on different effective ranges. Second, the rough quality bands you may have memorized for SSIM do not transfer one-for-one; MS-SSIM tends to crowd even closer to 1.0 for good content, which is why some pipelines report it in the decibel form 10·log₁₀(1/(1−MS-SSIM)) to stretch the crowded top end, exactly as they do for SSIM. Our 0.975 example is about 16 dB in that form.
MS-SSIM vs SSIM vs VMAF: when to reach for which
MS-SSIM sits between plain SSIM and the perception-trained metrics. The fastest way to hold the three structure-and-perception metrics in your head is side by side, with the columns that actually matter: what each measures, and where each lies.
| Metric | Scale | What it measures | Where it lies (blind spot) | Best for |
|---|---|---|---|---|
| SSIM | 0–1 | Structure at one scale: luminance, contrast, structure | One viewing condition; luma-only, per-frame; saturates near 1 | A fast structure check; per-frame quality maps |
| MS-SSIM | 0–1 | Structure across five scales, weighted by the eye | Still luma-only and per-frame; ignores temporal artifacts; gameable; reads higher than SSIM | Codec/encode comparison robust to display size and viewing distance |
| VMAF | 0–100 | Fused, perception-trained quality (model-dependent) | Meaningless without the model named; gameable by sharpening | Predicting viewer-perceived streaming quality at scale |
Table 3. The three metrics at a glance. MS-SSIM is the multi-scale upgrade to SSIM; VMAF is the learned, perception-trained step beyond both. Pick the row by the job, and always name the conditions — the tool and version for SSIM/MS-SSIM, the model for VMAF.
Figure 3. SSIM, MS-SSIM, and VMAF compared. MS-SSIM fixes SSIM's single-scale blind spot; VMAF learns directly from human scores. Each is a better proxy than the last, and each still lies somewhere.
The progression is worth stating plainly because it tells you when to use which. PSNR measures pixel error. SSIM measures structure at one scale. MS-SSIM measures structure across scales, so it is more robust when the display size or viewing distance is not fixed — which is why it is the default objective metric in most published codec comparisons. VMAF learns a quality model directly from human scores and usually correlates best of all, at the cost of needing its model named. For choosing a metric for a specific job, choosing the right metric is the decision guide; the encoder-operator's one-screen version of all of them lives in Video Encoding's quality-metrics article, which this deep dive points to rather than repeats.
How to compute MS-SSIM with FFmpeg
Here is the first trap, and it bites people immediately. FFmpeg's built-in ssim filter is single-scale only — it does not compute MS-SSIM, no matter what options you give it. MS-SSIM in the FFmpeg world comes from libvmaf, which reports a float_ms_ssim feature alongside the VMAF score. You need a current FFmpeg (5.1 or newer) built with --enable-libvmaf:
# MS-SSIM via libvmaf. Distorted first, reference second.
# libvmaf reports float_ms_ssim per frame and pooled, in the JSON log.
ffmpeg -i distorted.mp4 -i reference.mp4 \
-lavfi libvmaf="feature=name=float_ms_ssim:log_path=out.json:log_fmt=json" \
-f null -
Then read the float_ms_ssim fields out of out.json. One quirk to know: libvmaf always computes the VMAF score too, so there is no MS-SSIM-only fast path through FFmpeg — if you only want MS-SSIM you still pay for VMAF. Dedicated tools take a different route: Netflix's standalone vmaf CLI exposes the same float_ms_ssim feature, and the MSU VQMT tool computes MS-SSIM directly. The full FFmpeg-plus-libvmaf measurement workflow, with model selection and log parsing, is covered in measuring quality with FFmpeg and libvmaf.
To make the five-scale machine concrete at your desk, we built a small, dependency-light script that computes MS-SSIM between two image or frame files from first principles — the per-scale cs terms, the luminance term at the coarsest scale, the calibrated weights, and the final weighted product, printing every scale so you can watch the score assemble exactly as in Table 2. Download the from-scratch MS-SSIM calculator (Python) and run it on your own frames, or use its --demo mode to see the fine-grain-noise case.
The minimum-size trap
The second trap follows directly from the pyramid. Because MS-SSIM downsamples four times — a factor of 16 — and still needs an 11×11 window at the coarsest scale, the shorter side of your frame must be at least about 176 pixels (implementations differ slightly; some bottom out around 161). Feed MS-SSIM a thumbnail, a small crop, or a heavily letterboxed region narrower than that and the coarsest scale becomes smaller than the window, and the metric is either undefined or silently falls back to fewer scales. This is a real problem when measuring tiny UI thumbnails or sprite tiles, and a non-issue for any normal video resolution. The companion script handles it explicitly: below the threshold it reduces the scale count and renormalizes the weights, and tells you it did so.
Where MS-SSIM still lies
MS-SSIM is a better perceptual proxy than single-scale SSIM, not a perfect one. Naming its blind spots is the price of using it well, and several it inherits straight from SSIM.
First, it is still luma-only in practice. Most pipelines run MS-SSIM on the brightness channel and ignore color, so a chroma-only artifact — color bleeding, a chroma-subsampling problem — can leave the score almost untouched while the picture clearly looks wrong. The multi-scale machinery changes nothing about that; if color is the concern, MS-SSIM on luma will not catch it.
Second, it is spatial and per-frame. MS-SSIM scores each frame on its own, so it is blind to temporal problems — judder, flicker, a stutter in motion — that only exist across frames. A video that scores beautifully frame by frame can still play badly. Multi-scale fixes the spatial-scale blind spot, not the time axis.
Third, it still saturates near 1, more so than SSIM, because the coarse scales pull good content even closer to the ceiling. Small differences up there are hard to read without the decibel form.
Fourth, it can be gamed. An encoder tuned to maximize MS-SSIM — many support --tune ssim-style modes — can learn to please the metric in ways that do not please the eye, for instance by preserving exactly the mid-scale structure the weights reward while letting other quality slip. A metric used as an optimization target stops being an honest judge of that target. Note, too, that the popular AV1 encoders (libaom, SVT-AV1) tune for PSNR by default, not MS-SSIM, so a benchmark's tuning choice changes its results.
Fifth, the score depends on the implementation. Just as there is no single SSIM, there is no single MS-SSIM: the low-pass filter, the downsampling rule, and the window all vary between libvmaf, MATLAB reference code, and library ports. The same clip yields slightly different MS-SSIM numbers from different tools. The rule is strict — only compare MS-SSIM computed by the same tool, the same version, and the same settings, and name them when you report ("libvmaf 3.x float_ms_ssim", not just "MS-SSIM 0.98"). Where each objective metric misleads, content by content, is catalogued in where objective metrics lie.
Common mistakes with MS-SSIM
Mistake: comparing an MS-SSIM number against a single-scale SSIM number. MS-SSIM reads systematically higher because the coarse scales pull the weighted product up. They are different metrics on different effective ranges — never put them in the same column or trend line.
Mistake: expecting FFmpeg's
ssimfilter to give MS-SSIM. It is single-scale only. MS-SSIM comes fromlibvmafas thefloat_ms_ssimfeature (FFmpeg ≥ 5.1), or from VQMT or the standalonevmaftool.Mistake: running MS-SSIM on frames smaller than ~176 px on the short side. Four downsamples plus an 11×11 window run out of pixels; the metric is undefined or silently drops scales. Check the resolution before trusting the number.
Mistake: treating MS-SSIM as ground truth. Like every objective metric, it is a proxy validated against human Mean Opinion Scores using the statistics in ITU-T P.1401 (Pearson and Spearman correlation, RMSE). When MS-SSIM and a careful subjective test disagree, the eye wins and the metric failed on that content.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and MS-SSIM is the structure metric we reach for when an encode has to hold up across the range of screens our users actually watch on, from a phone to a conference-room display. We use it where it is strong: as the display-size-robust objective check in a codec or per-title comparison, with the tool and version named on every number. We are strict about its limits — we never compare it against single-scale SSIM, we watch the minimum-resolution floor, and we move to VMAF with its model stated when the question is "how good does this look to a streaming viewer." Our benchmark methodology records which metric produced every figure and why it was the right one for that comparison.
What to read next
- SSIM explained: structural similarity and why it beats PSNR
- VMAF explained: Netflix's perceptual metric
- Choosing the right metric for the job
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your ms-ssim plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Z. Wang, E. P. Simoncelli, A. C. Bovik, "Multiscale Structural Similarity for Image Quality Assessment," Proc. 37th IEEE Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 1398–1402, 2003. Tier 1 (metric-author defining paper). The source for the five-scale system, the combination formula (Eq. 7), the image-synthesis calibration, the per-scale weights (0.0448, 0.2856, 0.3001, 0.2363, 0.1333), and the LIVE-database correlation results. https://www.cns.nyu.edu/pub/eero/wang03b.pdf
- Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004. Tier 1 (metric-author defining paper). The single-scale SSIM that MS-SSIM extends: the luminance/contrast/structure terms, the 11×11 Gaussian window, and the K1/K2 constants. https://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
- Recommendation ITU-T P.1401 (01/2020), Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunication Union. Tier 1. Defines PCC, SROCC, and RMSE — how MS-SSIM and every objective metric is graded against subjective MOS. https://www.itu.int/rec/T-REC-P.1401
- Recommendation ITU-R BT.500-15 (2023), Methodologies for the subjective assessment of the quality of television images. International Telecommunication Union. Tier 1. The subjective ground truth against which objective metrics like MS-SSIM are validated. https://www.itu.int/rec/R-REC-BT.500
- FFmpeg, libvmaf filter documentation (the
float_ms_ssimfeature), accessed 2026-06-23. Tier 3 (first-party tooling). MS-SSIM is exposed through libvmaf, not the single-scalessimfilter; libvmaf also computes VMAF in the same pass. https://ffmpeg.org/ffmpeg-filters.html#libvmaf - FFmpeg, ssim filter documentation, accessed 2026-06-23. Tier 3 (first-party tooling). Confirms the built-in
ssimfilter is single-scale only — the basis for the "FFmpeg ssim is not MS-SSIM" warning. https://ffmpeg.org/ffmpeg-filters.html#ssim - Netflix, vmaf — VMAF Python library and
float_ms_ssimfeature, GitHub, accessed 2026-06-23. Tier 3 (first-party tooling / standards-adjacent implementation). The reference implementation that computes MS-SSIM as a libvmaf feature; documents the model/feature interface. https://github.com/Netflix/vmaf - A. K. Venkataramanan, C. Wu, A. C. Bovik, et al., "A Hitchhiker's Guide to Structural Similarity," IEEE Access, vol. 9, 2021. Tier 5 (peer-reviewed, metric author's lab). Documents how SSIM/MS-SSIM implementation choices (window, downsampling, constants) change the score — the basis for the "no single MS-SSIM" section. https://live.ece.utexas.edu/publications/2021/Hitchiker_SSIM_Access.pdf
- J. Ozer, "SVT-AV1 and Libaom Tune for PSNR by Default," Streaming Learning Center, 2024. Tier 6 (industry-educational). Supports the note that AV1 encoders default to PSNR tuning, not MS-SSIM, so a benchmark's tuning choice changes its results. https://streaminglearningcenter.com/articles/svt-av1-and-libaom-tune-for-psnr-by-default.html
- B. A. Wandell, Foundations of Vision. Sinauer Associates, 1995. Tier 5 (institutional reference). The contrast sensitivity function the MS-SSIM weights are compared against (peak sensitivity at middle spatial frequencies). Cited in the 2003 paper as ref. [7].


