Why this matters

If PSNR is the first quality number you meet, SSIM is the first one that was built to match the eye. It shows up in FFmpeg output, in encoder logs, in --tune ssim modes, in academic tables, and in every "PSNR vs SSIM" debate you will ever read — usually with no explanation of what it measures or why two tools report different SSIM values for the same clip. This article gives you that explanation: the structural insight behind the metric, the arithmetic shown once, the score range to expect, and the content where SSIM lies. It is the bridge between pixel-error metrics and the perception-trained metrics that follow it, VMAF above all.

What SSIM actually measures

Start with the idea that makes SSIM different, because everything else follows from it. PSNR measures error: how far each pixel drifted from the original, summed up. SSIM measures structure: whether the patterns in the picture — edges, textures, the relationships between neighboring pixels — are still intact.

The difference matters because of how human vision works. Your eye did not evolve to count pixel errors; it evolved to extract structure from a scene — to find the edge of the tiger against the grass. The authors of SSIM built that insight into the metric directly: under the assumption that vision is highly adapted to read structural information, they proposed measuring quality as the degradation of structure rather than the visibility of errors (Wang, Bovik, Sheikh, and Simoncelli, IEEE Transactions on Image Processing, 2004).

Here is the analogy to carry. If PSNR is a spell-checker counting wrong letters, SSIM reads the shape of the writing. A document with ten typos scattered in the footnotes and a document with ten typos that garble the headline have the same error count, but only one is still readable. SSIM is built to notice the difference: it asks whether the meaning — the structure — survived, not just how many characters changed.

Like PSNR, SSIM needs the pristine original to compare against, frame for frame. That makes it a full-reference metric — one that scores the impaired video against the complete uncompressed master. If you do not have the original on disk, you cannot compute SSIM at all, the same constraint that rules out full-reference metrics for live and user-generated content (the three reference setups are covered in full-reference, reduced-reference, no-reference metrics).

The three comparisons: luminance, contrast, structure

SSIM does not score a whole frame at once. It slides a small window across the image — an 11×11 patch of pixels in the original paper — and at every position it compares the original window x against the distorted window y on three separate questions. Each question becomes a number between 0 and 1, and the three are multiplied together.

The three comparisons use only four simple statistics of each window: the mean (average brightness, written μ), the variance (how much the pixels spread around that average, written σ², with σ the standard deviation), and the covariance between the two windows (how much they vary together, written σ_xy). Everything below is built from those.

Luminance comparison — did the brightness match? This compares the mean brightness of the two windows:

l(x,y) = ( 2·μx·μy + C1 ) / ( μx² + μy² + C1 )

It returns 1 when the two windows have the same average brightness and falls toward 0 as they drift apart. C1 is a small constant that keeps the formula stable when the means are near zero (more on the constants below).

Contrast comparison — did the range of light and dark match? This compares the variances — the spread of pixel values — so it catches a picture that has gone flat and washed out, or harsh and clipped:

c(x,y) = ( 2·σx·σy + C2 ) / ( σx² + σy² + C2 )

It returns 1 when both windows have the same contrast and drops as one becomes flatter or harsher than the other.

Structure comparison — did the pattern survive? This is the heart of SSIM. After removing brightness and contrast, what is left is the shape of the signal — the pattern of light and dark relative to the local average. The structure term measures how well those patterns line up, using the covariance:

s(x,y) = ( σ_xy + C3 ) / ( σx·σy + C3 )

This is, in effect, the correlation between the two windows once luminance and contrast are factored out. It returns 1 when the patterns match perfectly, drops toward 0 when they decorrelate, and can even go negative when the structure inverts. This is the term that notices blocking, ringing, and blur — the distortions that rearrange a picture's structure rather than merely shifting its brightness.

The combined index. The three comparisons are multiplied, each raised to a weight (α, β, γ) that sets its importance:

SSIM(x,y) = [ l(x,y) ]^α · [ c(x,y) ]^β · [ s(x,y) ]^γ

In the standard form everyone uses, all three weights equal 1, and with the convenient choice C3 = C2/2 the whole thing collapses into one compact formula:

                ( 2·μx·μy + C1 )( 2·σ_xy + C2 )
SSIM(x,y) = ----------------------------------------
              ( μx² + μy² + C1 )( σx² + σy² + C2 )

That single expression is the SSIM you will see quoted in papers and computed in tools. But the three-part version is the one worth holding in your head, because it tells you which kind of damage hurt the score: a low luminance term means the brightness shifted, a low contrast term means the picture flattened, and a low structure term means the patterns themselves broke.

SSIM anatomy: a reference and distorted window feed three comparisons — luminance, contrast and structure — which multiply into one SSIM score between 0 and 1. Figure 1. SSIM splits quality into three questions — did brightness, contrast, and structure survive? — then multiplies the answers. The structure term is what makes SSIM perceptual.

The constants, and the window

Two details turn the formulas above into the numbers a tool actually prints. First, the stabilizing constants C1 and C2 exist only to stop the fractions from blowing up when a window is nearly flat (means or variances near zero). They are defined from the pixel range L — 255 for 8-bit video — and two small default factors, K1 = 0.01 and K2 = 0.03:

C1 = (K1·L)² = (0.01·255)² = 6.5025
C2 = (K2·L)² = (0.03·255)² = 58.5225
C3 = C2 / 2  = 29.26

Second, the window. The original paper does not use a hard-edged square; it weights each 11×11 window with a circular Gaussian (standard deviation 1.5 pixels) so that pixels near the center count more, which avoids blocky artifacts in the resulting map. The window slides one pixel at a time across the whole image. Hold onto the fact that the window and the constants are choices, not laws of nature — different tools choose differently, and that is the source of SSIM's most common trap, covered below.

A worked example: see the three terms resolve

Numbers make this concrete. Take one window of the original and the matching window of a compressed version, and suppose we have measured these statistics:

  • Original window: mean μx = 120, standard deviation σx = 20 (variance 400).
  • Distorted window: mean μy = 115, standard deviation σy = 18 (variance 324).
  • Covariance between them: σ_xy = 330.

Plug into the three comparisons, using C1 = 6.5025, C2 = 58.5225, C3 = 29.26:

luminance  l = (2·120·115 + 6.5025) / (120² + 115² + 6.5025)
             = 27606.5 / 27631.5  = 0.999

contrast   c = (2·20·18 + 58.5225) / (400 + 324 + 58.5225)
             = 778.5 / 782.5      = 0.995

structure  s = (330 + 29.26) / (20·18 + 29.26)
             = 359.26 / 389.26    = 0.923

Multiply them:

SSIM = 0.999 · 0.995 · 0.923 = 0.917

Read what the arithmetic is telling you. The brightness barely moved (luminance 0.999) and the contrast held up well (0.995), but the structure term, 0.923, is doing all the damage — the patterns in the window shifted enough that the eye would notice. A single combined SSIM of 0.917 would not tell you why the window looks worse; the three-term breakdown says plainly that this is a structural problem, not a brightness one. That diagnostic power is a quiet advantage of SSIM over a single pixel-error number.

From one window to one number: the SSIM map and mean SSIM

The window gives you a local SSIM at every position, which means the metric does not produce one number — it produces a whole map, one SSIM value per pixel location. That map is useful in its own right: bright where structure survived, dark where it broke, it shows you where in the frame the damage landed, which a single dB figure never can.

To get the one number people quote, you average the map. The mean of all the local SSIM values is the Mean SSIM, often written MSSIM, and it is what a tool reports as "the SSIM" of a frame. Averaging is a pooling choice, and like every pooling choice it can hide things: a frame that is pristine everywhere except one badly broken region can still post a high mean SSIM, because the good windows outvote the bad one. When a localized artifact matters, look at the map or a low percentile, not just the mean — the same lesson that applies to every metric, covered in pooling: per-frame to one number.

From window to map to number: SSIM slides a window across the frame to build an SSIM map, then averages it into one Mean SSIM score. Figure 2. SSIM is computed per window, producing a map that shows where structure broke. The mean of that map is the single MSSIM score — and the mean can hide a local failure.

Reading an SSIM score: the 0-to-1 scale and the dB form

SSIM runs from 0 to 1 for real content (the math allows down to −1 when structure inverts, but you will rarely see negative values on ordinary compression). Higher is better; 1 means identical. The rough, content-dependent bands are worth knowing: above about 0.99 the difference is near-invisible to most viewers; 0.95 to 0.99 is the normal range for good streaming-grade compression; 0.90 to 0.95 starts to show visible degradation on attentive viewing; below about 0.90 the damage is usually obvious. Treat those as orientation, not law — what a given SSIM "looks like" depends on the content and the viewing conditions, and a properly run subjective test is the only ground truth.

There is a second wrinkle: because high-quality content crowds into a narrow band near 1.0, small SSIM differences up there are hard to read. The fix tools use is a decibel form, which stretches that crowded top end:

SSIM (dB) = 10 · log₁₀ ( 1 / (1 − SSIM) )

FFmpeg prints this alongside the linear score. The shape is easy to remember: every extra "nine" adds 10 dB. An SSIM of 0.9 is 10 dB, 0.99 is 20 dB, 0.999 is 30 dB. Our worked example of 0.917 is 10·log₁₀(1 / 0.083) = 10.8 dB. The dB form does not add information; it just makes differences near the ceiling easier to see and compare.

The SSIM scale from 0 to 1 with quality bands, the worked example at 0.917, and the decibel conversion where each nine adds 10 dB. Figure 3. The SSIM scale. Bands are content-dependent rules of thumb; the dB form stretches the crowded top end so 0.99 and 0.999 are easy to tell apart.

Why SSIM beats PSNR

Now the headline claim, made precise. SSIM beats PSNR at the one job that matters — predicting how good a video looks to a person — for a specific, structural reason.

Recall PSNR's blind spot: it sums pixel error and has no idea where the error landed or what it did to the picture. Two distortions with the same total error get the same PSNR even if one is an invisible haze and the other wrecks a face. SSIM closes that gap because its structure term reacts to what kind of damage occurred, not just how much. Blur, blocking, and ringing all rearrange local structure, so they drive the structure term down hard — exactly as the eye reacts. A small, even brightness shift barely touches structure, so SSIM forgives it — again, as the eye does.

The payoff shows up when you grade each metric against human opinion. Every objective metric is a proxy whose worth is measured by how well it correlates with subjective Mean Opinion Scores, using the statistics defined in ITU-T P.1401 (Pearson correlation, Spearman rank correlation, RMSE). On that test SSIM consistently outranks PSNR: across standard image-quality databases SSIM tracks human scores more closely than PSNR does, which is the entire reason it was adopted. The honest caveat is that "more closely" is database-dependent and that newer perception-trained metrics like VMAF beat SSIM in turn — the progression is PSNR measures error, SSIM measures structure, VMAF learns directly from human scores, each a better proxy than the last. How metrics are validated against the eye is the subject of validating metrics against human scores.

Why SSIM beats PSNR: the same two distortions that score identical PSNR get different SSIM because the structure term reacts to structural damage. Figure 4. PSNR sums error and cannot separate a harmless haze from structural damage; SSIM's structure term can, which is why it correlates better with the eye.

How to compute SSIM with FFmpeg

You will almost never code the formula yourself. The everyday tool is FFmpeg, whose ssim filter takes two inputs — the distorted clip first, the reference second — and prints the per-component and overall SSIM, plus the dB form:

# SSIM between a distorted encode and its reference master.
# Prints SSIM Y, U, V, All and All in dB to the console.
ffmpeg -i distorted.mp4 -i reference.mp4 \
  -lavfi ssim -f null -

To capture a value per frame — useful for finding the worst moments, not just the average — write a stats file:

# Per-frame SSIM written to ssim.log (n, Y, U, V, All, dB).
ffmpeg -i distorted.mp4 -i reference.mp4 \
  -lavfi "ssim=stats_file=ssim.log" -f null -

The command encodes the full-reference contract: it will not run without both files. If you want PSNR, SSIM, and VMAF in a single pass, the libvmaf filter can emit all three — that workflow, with model selection and output parsing, is covered in measuring quality with FFmpeg and libvmaf, and the encoder-side quick reference lives in Video Encoding's FFmpeg cheat sheet.

To make the math concrete at your desk, we built a small, dependency-light script that computes SSIM between two image or frame files from first principles — the three comparisons, the windowing, the SSIM map, and the mean — printing the luminance, contrast, and structure terms so you can see exactly how the score is built. Download the from-scratch SSIM calculator (Python) and run it against your own frames.

The implementation trap: SSIM is not one number

Here is the mistake that bites teams most often, and it follows directly from the window and constants being choices. There is no single "SSIM" — there is a family of implementations that disagree. The original paper uses an 11×11 Gaussian window sliding one pixel at a time. FFmpeg's filter, built for speed on video, uses an 8×8 window with a stride of 4 pixels. Other libraries default to other windows, other constants, color spaces, and downsampling rules. Feed the same two clips to two implementations and you will get two different SSIM numbers — sometimes different in the second decimal, occasionally more.

The practical rule is strict: an SSIM comparison is only valid when every number came from the same tool, the same version, the same settings, and the same color component. Compare your encode's FFmpeg SSIM against another encode's scikit-image SSIM and the comparison is meaningless, the same apples-to-apples discipline PSNR demands. When you report SSIM, name the tool and version — "FFmpeg 7.x ssim filter, luma" — not just "SSIM 0.98".

Where SSIM still lies

SSIM is a better perceptual proxy than PSNR, not a perfect one. Naming its blind spots is the price of using it well.

First, it is usually computed on luma only. Most pipelines run SSIM on the brightness channel and ignore color, so a chroma-only artifact — color bleeding, a chroma-subsampling problem — can leave the SSIM you quote almost untouched while the picture clearly looks wrong. If color is the concern, SSIM on luma will not catch it.

Second, it is spatial and per-frame. Plain SSIM scores each frame on its own, so it is blind to temporal problems — judder, flicker, a stutter in motion — that only exist across frames. A video that scores beautifully frame by frame can still play badly.

Third, it saturates near 1. Once content is good, SSIM crowds into a thin band near 1.0 where small differences are hard to interpret, which is part of why the dB form and the perception-trained metrics exist.

Fourth, it is single-scale. SSIM at one window size assumes one viewing distance and display resolution. Change those and the "right" scale changes. The fix is MS-SSIM, which evaluates structure across several scales.

Fifth, mean pooling hides local failures, as the SSIM map already showed. And like any objective metric, SSIM can be gamed: an encoder tuned to maximize SSIM can learn to please the metric in ways that do not please the eye. Where each objective metric misleads, content by content, is catalogued in where objective metrics lie.

MS-SSIM in one paragraph

Because plain SSIM measures structure at a single scale, the natural extension is to measure it at several. MS-SSIM (multi-scale SSIM) runs the contrast and structure comparisons on the image, then repeatedly low-pass-filters and downsamples it — five scales in the standard design — and combines the per-scale results with fixed weights (0.0448, 0.2856, 0.3001, 0.2363, 0.1333) that were tuned against human judgments (Wang, Simoncelli, and Bovik, 2003). The result is more stable across viewing distance and display resolution than single-scale SSIM, and it is what most modern pipelines reach for when they want a structure-based number. The full treatment, including when MS-SSIM is worth the extra cost, is in MS-SSIM: multi-scale structural similarity.

PSNR vs SSIM vs VMAF: when to reach for which

SSIM is the middle of the three metrics you will meet constantly. The fastest way to hold them in your head is side by side, with the columns that actually matter: what each one measures, and where each one lies.

Metric Scale What it measures Where it lies (blind spot) Best for
PSNR dB (≈20–50, ∞ if identical) Average pixel error vs the original Ignores where errors are; same score for very different distortions Same-codec comparison, RDO, regression gates, lossless checks
SSIM 0–1 (higher better) Structural similarity: luminance, contrast, structure Luma-only, per-frame, single-scale; implementation-dependent A better-than-PSNR perceptual proxy; per-frame structure; quality maps
VMAF 0–100 (higher better) Fused, perception-trained quality (model-dependent) Meaningless without the model named; gameable by sharpening Predicting viewer-perceived streaming quality at scale

Table 1. The three full-reference metrics at a glance. All need the original; they differ in how close they get to the eye and what they miss. Pick the row by the job, and always name the conditions — the tool and version for SSIM, the model for VMAF.

This is the deep dive on the middle row. The encoder-operator's one-screen version of all three lives in Video Encoding's quality metrics article; if you are choosing between metrics for a specific job, choosing the right metric is the decision guide.

Common mistakes with SSIM

Mistake: comparing SSIM across different tools or implementations. FFmpeg's 8×8 stride-4 SSIM and the paper's 11×11 Gaussian SSIM are different numbers for the same clip. Only compare SSIM computed by the same tool, version, settings, and component. "SSIM 0.98" with no tool named is ambiguous.

Mistake: trusting one mean SSIM. A high average can hide one badly broken region, because good windows outvote bad ones. When a localized artifact matters, read the SSIM map or a low percentile, not just the mean — pooling decides what the single number hides.

Mistake: quoting luma SSIM on a color problem. SSIM is usually computed on brightness only, so a chroma artifact can leave it almost unchanged. If color is the question, SSIM on luma is the wrong witness.

Mistake: treating SSIM as ground truth. SSIM is a proxy validated against human scores, not the truth itself. When SSIM and a careful subjective test disagree, the eye wins and the metric failed on that content.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and SSIM is the metric we reach for when PSNR's pixel-counting is not enough but a full perceptual model is more than the question needs. We use it where it is strong: as a structure-aware check on an encode, and especially for its map, which shows where in a frame quality broke rather than just how much. We are strict about its limits — we name the tool and version on every SSIM number, never compare across implementations, and move to VMAF with its model stated when the question is "how good does this look to the viewer." Our benchmark methodology records which metric produced every figure and why it was the right one for that comparison.

What to read next

Call to action

References

  1. Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004. Tier 1 (metric-author defining paper). The paper that introduced SSIM, the three comparisons, the 11×11 Gaussian window, and the constants. https://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
  2. Z. Wang, E. P. Simoncelli, A. C. Bovik, "Multiscale Structural Similarity for Image Quality Assessment," Proc. 37th IEEE Asilomar Conference on Signals, Systems and Computers, 2003. Tier 1 (metric-author defining paper). Defines MS-SSIM, the five scales, and the per-scale weights. https://www.cns.nyu.edu/pub/eero/wang03b.pdf
  3. Recommendation ITU-T P.1401 (01/2020), Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunication Union. Tier 1. Defines PCC, SROCC, and RMSE — how SSIM and every objective metric is graded against subjective scores. https://www.itu.int/rec/T-REC-P.1401
  4. Recommendation ITU-R BT.500-15 (2023), Methodologies for the subjective assessment of the quality of television images. International Telecommunication Union. Tier 1. The subjective ground truth against which objective metrics like SSIM are validated. https://www.itu.int/rec/R-REC-BT.500
  5. FFmpeg, ssim filter documentation, accessed 2026-06-23. Tier 3 (first-party tooling). The ssim filter (two inputs, per-plane Y/U/V + All output, dB form, stats_file); FFmpeg uses an 8×8 window with stride 4, and ssim_db = 10·log10(1/(1−SSIM)). https://ffmpeg.org/ffmpeg-filters.html#ssim
  6. FFmpeg, libavfilter/vf_ssim.c source, accessed 2026-06-23. Tier 3 (first-party tooling). The reference implementation showing the 8×8 stride-4 window and the ssim_db conversion used in the article. https://github.com/FFmpeg/FFmpeg/blob/master/libavfilter/vf_ssim.c
  7. Q. Huynh-Thu and M. Ghanbari, "Scope of validity of PSNR in image/video quality assessment," Electronics Letters, vol. 44, no. 13, pp. 800–801, 2008. Tier 5 (peer-reviewed). The canonical critique of PSNR that motivates structure-based metrics like SSIM. https://doi.org/10.1049/el:20080522
  8. A. K. Venkataramanan, C. Wu, A. C. Bovik, et al., "A Hitchhiker's Guide to Structural Similarity," IEEE Access, vol. 9, 2021. Tier 5 (peer-reviewed, metric author's lab). Documents how SSIM implementation choices (window, downsampling, constants) change the score — the basis for the implementation-trap section. https://live.ece.utexas.edu/publications/2021/Hitchiker_SSIM_Access.pdf
  9. J. Nilsson and T. Akenine-Möller, "Understanding SSIM," arXiv:2006.13846, 2020. Tier 5. A careful study of SSIM's behavior and counter-intuitive cases; supports the "where SSIM lies" section. https://arxiv.org/abs/2006.13846
  10. Netflix Technology Blog, "Toward A Practical Perceptual Video Quality Metric" (VMAF) and the Netflix/vmaf repository, accessed 2026-06-23. Tier 4 / Tier 3 (credible deployer / metric-author tooling). Netflix's account of why PSNR and SSIM were insufficient for streaming, motivating VMAF. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
  11. Z. Wang, "SSIM Index for Image Quality Assessment" (author's reference page and MATLAB implementation), University of Waterloo, accessed 2026-06-23. Tier 1 / Tier 3 (metric author's own reference code). The canonical implementation and parameter defaults. https://ece.uwaterloo.ca/~z70wang/research/ssim/
  12. "Structural similarity index measure," Wikipedia, accessed 2026-06-23. Tier 6 (educational, orientation only). Source for the score-band rules of thumb and the dB form; all primary claims are cited to the papers and standards above. https://en.wikipedia.org/wiki/Structural_similarity_index_measure