Why this matters

The number you put in a report or a quality gate is not the metric's raw output — it is a summary of thousands of per-frame scores, and the summary method silently decides what the number means. Report a mean VMAF of 88 and the encode looks safe; report the 5th-percentile VMAF of the same encode and you might see 30, because one second of it fell apart. If you do not state how you pooled, two engineers measuring the identical file will disagree, and a regression gate can pass a release that visibly stutters. This article is for the video engineer, encoding lead, or QA engineer who reports VMAF or SSIM numbers and needs the one number to mean what they think it means. It assumes you have met the metrics already; the per-metric deep dives are PSNR explained, SSIM explained, and VMAF explained.

A metric scores every frame; you report one number

Start with what a full-reference metric actually emits. A full-reference metric — one that needs the pristine original to compare against, the setup explained in the three measurement setups — runs frame by frame. It lines up frame 1 of your compressed video against frame 1 of the original, produces a score, then does the same for frame 2, and so on to the end. A ten-second clip at 30 frames per second is 300 comparisons and 300 scores.

Nobody wants 300 numbers. A report, a dashboard, or a pass/fail gate needs one. The operation that collapses the per-frame scores into that one number is pooling — think of it as the grade a teacher gives for a whole course after marking every assignment. The catch is that a course grade can be computed many ways: average every assignment, drop the lowest, or fail anyone who bombed the final. Each rule is defensible, and each produces a different grade from the identical marks. Pooling video-quality scores is exactly this choice, and most teams never realize they made it.

That is the whole problem in one sentence: the per-frame scores are fixed, but the single number you report depends entirely on how you pool them. Change the pooling rule and the headline score moves, even though not a single pixel of the video changed.

Diagram: a metric scores each frame, spatial pooling makes a per-frame score, temporal pooling makes one clip score. Figure 1. Pooling happens twice. Inside a frame, a per-pixel quality map is pooled into one frame score; across the clip, the per-frame scores are pooled into one number.

Pooling happens twice: across space, then across time

There are actually two pooling steps hiding in every metric, and naming both keeps you from confusing them. The first happens inside a single frame. SSIM — Structural Similarity, the metric that compares the structure of two images, covered in SSIM explained — does not produce one number per frame directly. It produces a quality map: a score for every small window of the picture. The familiar single-frame SSIM you have seen is the average of that map — spatial pooling by the arithmetic mean, defined in the original SSIM paper as "mean SSIM" (Wang, Bovik, Sheikh, Simoncelli, IEEE TIP, 2004).

The second step happens across frames, and it is the one this article is mostly about. Once each frame has a single score, you pool those per-frame scores along time into the one number for the clip. This is temporal pooling.

Both steps face the identical trap: an average smooths over the bad parts. A frame can be flawless across 95% of its area and badly broken in a small region — a block of smeared text, a banded patch of sky — and the mean of its quality map still reads "excellent" because the good area outvotes the bad. The clip-level number has the same weakness one level up: a few terrible frames vanish into a sea of good ones. The fix is the same at both levels, so once you understand temporal pooling, spatial pooling is the same idea applied to pixels instead of frames.

The default is the arithmetic mean, and it averages away the truth

Here is the fact that surprises most engineers: the headline VMAF or SSIM number you have been quoting is almost certainly an arithmetic mean, because the mean is the default in every common tool. Netflix states it plainly — VMAF "reports the aggregate score as the average (i.e. arithmetic mean) of the per-frame scores mostly for its simplicity, as well as for consistency with other metrics (e.g. mean PSNR)" (Netflix VMAF FAQ, 2024). The arithmetic mean — add up every per-frame score and divide by the number of frames — weights every frame equally.

Equal weighting is precisely the problem. A viewer does not weight every frame equally. They notice the moment the picture fell apart and barely register the thousands of frames that looked fine. The mean does the opposite of human attention: it lets a long good stretch drown out a short bad one.

Picture a ten-second clip where nine seconds are clean and one second — a hard-to-encode explosion, a fast pan, a scene cut — collapses. At 30 frames per second that is 270 frames at a VMAF of 95 and 30 frames at a VMAF of 30. The arithmetic mean is:

mean = (270 × 95 + 30 × 30) / 300
     = (25,650 + 900) / 300
     = 26,550 / 300
     = 88.5

The clip scores 88.5 — comfortably in "good" territory, the kind of number that clears a quality gate without a second look. Yet one full second of it looked clearly bad, and that is the second your viewer will remember. The mean did not lie about the math; it answered the wrong question. It told you the average frame was fine when you needed to know that the worst second was not.

Per-frame VMAF timeline: nine seconds near 95, a one-second dip to 30, with the mean line sitting at 88.5 above the dip. Figure 2. The mean (88.5) floats above a one-second collapse to VMAF 30. The flat average line is exactly what hides the dip a viewer would notice.

Why the worst moments decide perceived quality

This is not just intuition. The research on how humans pool quality over time points one direction: people weight the worst parts far more than the average. Netflix says so directly in the same FAQ — "there are psycho-visual evidences ... that human opinions tend to weigh more heavily towards the worst-quality frames" — and adds the honest caveat that the best pooling is still an open question that depends on the time scale, "seconds vs minutes" (Netflix VMAF FAQ, 2024).

The academic work is sharper. A study of how local quality scores distribute across space and time found that "severe video distortions that are transient in space and/or time have a large effect on overall perceived video quality," and built a pooling method that deliberately emphasizes the worst scores along both dimensions (Park, Seshadrinathan, Lee, Bovik, IEEE Transactions on Image Processing, 2013). The lesson: a brief, severe glitch hurts the human verdict out of all proportion to how few frames it touched — which is exactly the signal the arithmetic mean throws away.

There is a second human effect worth knowing, the recency effect: viewers weight what they saw most recently more heavily, so a stall or a quality drop near the end of a clip costs more than the same drop in the middle. The standardized streaming-quality model ITU-T P.1203 builds this last-impression weighting into its temporal integration step (ITU-T P.1203, 2017). Recency matters most for long-form streaming, where the streaming QoE metrics take over from the picture metrics; for a short clip you are usually safe to ignore it, but it explains why "average quality" and "remembered quality" are not the same number.

The pooling toolbox

If the mean buries the worst frames, the fix is to pool in a way that surfaces them. VMAF's command-line tools expose a --pool option with several methods — mean, harmonic_mean, median, min, perc5, perc10, and perc20 — and the FFmpeg libvmaf filter exposes min, harmonic_mean, and mean (Netflix VMAF FAQ, 2024; FFmpeg libvmaf filter documentation). Here is what each one does to our explosion clip, and what it is good for.

The harmonic mean averages the reciprocals of the scores, which mathematically drags the result toward the low values. For our clip:

harmonic mean = 300 / (270/95 + 30/30)
              = 300 / (2.8421 + 1.0000)
              = 300 / 3.8421
              = 78.1

The harmonic mean reports 78.1 instead of 88.5 — more than ten points lower, because the bad second now counts more. It is a gentle correction: the score still reflects the whole clip, but the low frames get a louder vote.

Percentile pooling reports the score at the bad end of the distribution. The 5th-percentile score (perc5) is the value below which the worst 5% of frames fall; the 10th-percentile (perc10) is the worst 10%. In our clip the worst 10% of frames are exactly the bad second, so both perc5 and perc10 report 30 — the percentile pool tells you the truth the mean hid. This is the closest single number to "how bad does it get for a noticeable stretch."

The minimum (min) reports the single worst frame — here, 30. It is the strictest possible pool and the most sensitive to a one-frame fluke, so it is best as an alarm ("did anything ever go badly wrong?") rather than a headline score.

And a warning about the median — the middle value when the scores are sorted. It sounds safe, but it is the worst choice for catching a short defect. In our clip the median is 95, because the bad second is fewer than half the frames, so the middle frame is a good one. The median does not just miss the explosion; it certifies the clip as flawless. Use the median only when you expect the bad frames to be the majority, which for a quality defect they almost never are.

Bar chart comparing pooled scores of the same clip: mean 88.5, harmonic 78.1, median 95, perc10 30, min 30. Figure 3. The same 300 frames, five pooling methods, five different headline scores from 95 down to 30 — and not a pixel changed between them.

One more method appears in the research rather than the default tools: Minkowski pooling, a tunable average with an exponent that controls how hard it leans toward the low scores (exponent 1 is the ordinary mean; higher exponents weight the bad frames more, with around 4 a common choice). A study that re-pooled VMAF with a Minkowski mean found it "approximates well the subjectively measured QoE" and beat the arithmetic mean on correlation with human scores, confirming from the other direction that the plain average overestimates quality (Batsi, Kondi, IS&T Electronic Imaging, 2020).

A caution on the harmonic mean and zero scores

The harmonic mean has a sharp edge worth knowing before you trust it. Because it sums reciprocals, a single frame scoring exactly zero makes one term 1/0, which is undefined, and the whole pool blows up or collapses to zero. VMAF scores can reach zero (the model clips negative predictions), so real implementations quietly shift zero and near-zero scores to a tiny positive value before pooling. That is a reasonable engineering fix, but it means the harmonic mean of a clip with several near-zero frames is sensitive to that offset and should be read with care. When most of a clip is badly broken, prefer a percentile or the minimum, which have no division to break.

Spatial pooling: the same trap, one level down

Everything above applies inside a single frame too, and it is where a metric can fool you even when the temporal pooling is fine. Recall that the per-frame SSIM you quote is the mean of a quality map. Suppose a frame is near-perfect across 95% of its area — SSIM 0.99 — but a small overlaid caption is badly compressed at SSIM 0.70. The mean SSIM is:

mean SSIM = 0.95 × 0.99 + 0.05 × 0.70
          = 0.9405 + 0.035
          = 0.976

That 0.976 reads as excellent, yet the text — the part a viewer's eye goes straight to — is visibly mangled. The same percentile and Minkowski ideas fix spatial pooling: weighting the worst regions of the map tracks human judgment better than averaging it, because, as the spatial-pooling literature puts it, low-quality regions are more salient to viewers and largely determine perceived quality (Moorthy, Bovik, SPIE, 2009). When you read about a metric "missing" a localized defect like banding or a text artifact, mean spatial pooling is often the reason — the catalogue of these failures is where objective metrics lie.

The pooling methods at a glance

Put the temporal pooling methods in one place. The two columns that matter are the last two: what the method actually tells you, and where it misleads if you read it alone.

Pooling method Our clip's score What it tells you Where it lies / blind spot
Arithmetic mean 88.5 Average frame quality Buries short bad stretches; the default, and the most misleading for defects
Harmonic mean 78.1 Average tilted toward low frames Gentle; breaks on zero scores; still a single summary
Median 95.0 The middle frame Ignores any defect shorter than half the clip — usually the wrong tool
Percentile (perc5/10) 30.0 How bad the worst 5–10% gets Picks a cutoff you must justify; ignores how often it happens
Minimum 30.0 The single worst frame Over-reacts to a one-frame fluke; an alarm, not a score

Read it as a panel, not a verdict. A trustworthy quality report shows the mean and a worst-case pool (a low percentile or the minimum) together, so a reader sees both the typical quality and the floor. One number alone — whichever you pick — throws away half the story.

Common mistake: reporting a pooled score without naming the pooling. "VMAF 88" is half a sentence. Pooled how — mean, harmonic, 5th percentile? Two engineers measuring the same file with pool=mean and pool=harmonic_mean will report different numbers and both be right, and a comparison between a mean-pooled encode and a percentile-pooled encode is meaningless. The fix is one clause: "mean VMAF 88, 5th-percentile VMAF 30." Always pool both clips the same way, and always say which way — the apples-to-apples rule from reading a quality-metric report starts here.

Which pooling for which job

There is no single correct pool; there is a correct pool for each job. For an encoder comparison or a rate-quality curve, the arithmetic mean is fine and conventional, because you are comparing typical quality and both sides share the bias — just keep the pooling identical across encodes. For a quality gate in CI/CD, gate on a worst-case pool: a low percentile (perc5 or perc10) or the minimum catches the regression a mean would wave through, the logic behind quality gates in CI/CD. For a delivered-quality report to a stakeholder, show the mean and a worst-case pool side by side so the summary is honest. For streaming QoE over a long session, the picture-metric pool is only part of the story — recency and stalls dominate, and that is the streaming QoE layer, not a frame pool.

Decision guide mapping the job — encoder comparison, quality gate, stakeholder report, streaming — to the right pooling method. Figure 4. Pick the pool by the job. Comparisons can use the mean; gates need a worst-case pool; reports show both.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and pooling is one of the first things we pin down when we set a quality gate for a client pipeline. The content we measure is full of short, severe events: a camera hand-off in surveillance, a shared-screen redraw in conferencing, a scene cut in OTT, and an arithmetic mean would let every one of them through. So our gates read a worst-case pool, not just the average, and our reports always state the pooling method and the percentile alongside the headline mean. We record the pooling choice with every figure in our benchmark methodology, because a quality number you cannot reproduce — including how it was pooled — is not a measurement.

What to read next

Call to action

References

  1. Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004. Tier 1 (metric-author defining work). Defines SSIM and its quality map, and defines the single-image score as "mean SSIM" (MSSIM) — the arithmetic spatial pooling of the map. The origin of the spatial-pooling-by-mean convention this article critiques. https://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf
  2. Netflix, "VMAF Frequently Asked Questions" (resource/doc/faq.md), Netflix/vmaf GitHub repository, accessed 2026-06-23. Tier 1 (metric-author primary). States that VMAF's aggregate is the arithmetic mean of per-frame scores by default; that psycho-visual evidence shows humans weight the worst frames more; that optimal pooling is an open question dependent on time scale; and documents the --pool methods mean, harmonic_mean, median, min, perc5, perc10, perc20. https://github.com/Netflix/vmaf/blob/master/resource/doc/faq.md
  3. Recommendation ITU-T P.1203, "Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport," International Telecommunication Union, 2017 (and later editions). Tier 1 (official standard). Its quality-integration module performs temporal pooling of per-one-second scores over a session, explicitly modelling the recency / last-impression effect and the impact of stalling and quality fluctuation — the standardized treatment of pooling for streaming QoE. https://www.itu.int/rec/T-REC-P.1203
  4. Recommendation ITU-T P.910, "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, 2023. Tier 1 (official standard). The subjective-test methodology that produces the human Mean Opinion Scores any pooling method is ultimately trying to predict — the ground truth that makes "the worst frames matter more" a measurable claim rather than an opinion. https://www.itu.int/rec/T-REC-P.910
  5. J. Park, K. Seshadrinathan, S. Lee, A. C. Bovik, "Video Quality Pooling Adaptive to Perceptual Distortion Severity," IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 610–620, February 2013. Tier 5 (peer-reviewed). Shows that severe distortions transient in space and/or time dominate perceived quality, and proposes a content-adaptive spatial-and-temporal pooling that emphasizes the worst local scores; the academic foundation for worst-case pooling. https://live.ece.utexas.edu/publications/2013/VQpooling%20TIP.pdf
  6. S. Batsi, L. P. Kondi, "Improved Temporal Pooling for Perceptual Video Quality Assessment Using VMAF," IS&T International Symposium on Electronic Imaging: Human Vision and Electronic Imaging, 2020. Tier 5 (peer-reviewed). Validates that the arithmetic mean of per-frame VMAF conceals bad-quality frames and overestimates delivered quality, and shows a Minkowski-mean pool correlates better with subjective QoE (higher SROCC/PCC, lower RMSE). https://www.cs.uoi.gr/~lkon/papers/EI_20b.pdf
  7. A. K. Moorthy, A. C. Bovik, "Perceptually Significant Spatial Pooling Techniques for Image Quality Assessment," Proc. SPIE 7240, Human Vision and Electronic Imaging XIV, 2009. Tier 5 (peer-reviewed). Demonstrates that weighting the worst (low-quality) spatial regions of a quality map predicts human opinion better than averaging the whole map — the spatial-pooling counterpart to temporal worst-case pooling. https://live.ece.utexas.edu/publications/2009/akm_spie09.pdf
  8. FFmpeg Project, "libvmaf" filter documentation (FFmpeg Filters, current release). Tier 3 (first-party tooling). Documents the libvmaf filter's pool option with values min, harmonic_mean, and mean (default) — the practical interface most engineers use to choose a pooling method when measuring with FFmpeg. https://ffmpeg.org/ffmpeg-filters.html#libvmaf
  9. H. Tu, Y. Wang, N. Birkbeck, B. Adsumilli, A. C. Bovik, "A Comparative Evaluation of Temporal Pooling Methods for Blind Video Quality Assessment," IEEE International Conference on Image Processing (ICIP), 2020. Tier 5 (peer-reviewed). Compares mean, harmonic-mean, percentile, and hysteresis temporal pooling for no-reference video quality, quantifying how much the pooling choice changes correlation with human scores. https://utw10503.utweb.utexas.edu/publications/2020/ICIP2020_PoolingVQA.pdf