Why this matters
VMAF is the quality number the streaming industry now argues in. It decides bitrate ladders, settles "which encoder is better" debates, gates releases, and shows up in every modern FFmpeg quality report — usually quoted as a bare "VMAF 95" with no model named, which is the single most common way engineers fool themselves with it. This article gives you the metric in full: the fusion idea that makes it perceptual, the training that grounds it in human opinion, the score scale and the just-noticeable-difference threshold, the model-selection rules, and the blind spots you must name before you trust a number. It is the deep dive that the encoder-side quality-metrics overview in our Video Encoding section points to.
What VMAF actually is
Start with the problem VMAF was built to solve. PSNR measures pixel error and correlates weakly with what people see. SSIM measures structural similarity and tracks the eye better, but it still measures only one kind of thing, and it saturates near the top of its scale where streaming-grade encodes live. Netflix, grading thousands of encodes a day across a vast catalogue, needed a number that matched human opinion closely enough to drive real bitrate decisions. No single existing metric did the job on all their content.
Their answer was to stop looking for one perfect measurement and instead combine several imperfect ones. The full name says it: Video Multimethod Assessment Fusion. VMAF takes a handful of established quality measurements — each good at catching a different kind of damage — and fuses them with a model that learned, from human ratings, how to weigh them. The result is a single score from 0 to 100 that predicts the Mean Opinion Score a panel of viewers would give the clip (Netflix Technology Blog, "Toward a Practical Perceptual Video Quality Metric", June 2016).
Two facts about VMAF frame everything else. First, like PSNR and SSIM, it is a full-reference metric: it needs the pristine original on disk to compare the impaired video against, frame for frame. No reference, no VMAF — which rules it out for live and user-generated content with no master to compare (the three setups are covered in full-reference, reduced-reference, no-reference metrics). Second, VMAF is trained, not derived. PSNR comes from a formula; VMAF comes from a model fitted to human scores. That is why it is more accurate, and also why it carries baggage a formula never would — a training set, a viewing condition, and a way to be gamed.
The fusion idea: what VMAF combines
Here is the analogy to carry through the article. PSNR is a single judge who only counts spelling mistakes. VMAF is a small panel of specialist judges — one watches for lost information, one watches for lost detail, one watches how the picture moves — and a chairman who learned from experience how much to trust each judge on each kind of footage. The panel beats any single judge because each specialist catches something the others miss.
In the original and long-dominant VMAF design (the "v0" generation, models named vmaf_v0.6.1 and friends), the panel had three kinds of judge.
Visual Information Fidelity, or VIF — how much of the picture's information survived. VIF treats the original frame as a source of visual information and asks how much of that information makes it through compression to your eye, modelled on how the human visual system takes in a scene. It is computed at four spatial scales, from coarse to fine, so it notices both broad and fine losses (Sheikh and Bovik, "Image Information and Visual Quality", IEEE Transactions on Image Processing, 2006). When compression strips away the subtle texture of skin or foliage, VIF is the judge that reacts.
Detail Loss Metric, or DLM — how much fine detail was destroyed, separately from noise that was added. DLM splits the damage into two parts: detail that was lost (which makes the picture less clear) and impairment that was added (extra junk that distracts the eye), and it scores the detail loss on its own (Li, Zhang, Ma, and Ngan, "Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments", IEEE Transactions on Multimedia, 2011). This is the judge that catches blur and the smearing-away of texture.
Motion — how much the picture is moving. VMAF measures the temporal difference between neighbouring frames — in the open-source package, the mean co-located pixel difference on the luma (brightness) channel. Motion matters because the eye forgives more distortion when the scene is moving fast; a still frame is scrutinised, a fast pan is not. Feeding motion into the model lets it calibrate how harshly to judge the spatial damage.
Each of these produces a number per frame. On its own, none is VMAF. The point of VMAF is what happens next.
Figure 1. The classic VMAF (v0) design: per-frame features feed a model trained on human scores, which fuses them into one perceptual number. The 2026 v1 generation changes the feature set — see below.
How the model learned: training on human scores
The judges' scores are combined by a trained model — in the open-source VMAF, a Support Vector Machine regressor, a standard machine-learning model that learns a smooth mapping from a set of input numbers (the feature scores) to one output number (the predicted quality). It is "trained" in the literal sense: it was shown thousands of examples where the feature scores were known and the true quality was measured, and it fitted the weights that best reproduce the true quality from the features (Netflix Technology Blog, 2016).
Where did the "true quality" come from? From people. Netflix built a dataset of clips from its own catalogue, each ten seconds long, encoded at a grid of resolutions and quality levels to produce a wide range of impairments. Human viewers rated each one in a controlled lab test using the Absolute Category Rating method — watch a clip, score it on a scale from bad to excellent — defined in ITU-T P.910. Those raw opinions were cleaned of unreliable voters and noise into stable per-clip scores, then mapped onto the VMAF scale, where roughly "bad" lands near 20 and "excellent" lands at 100 (Netflix VMAF documentation, model notes, accessed 2026-06-23).
This is the whole basis of VMAF's authority, and the reason to respect it: the metric is a model of human opinion, fitted to real human opinion. Remember the central rule of this section — every objective metric is a proxy validated against subjective scores, and the eye is the ground truth (the distinction is laid out in subjective vs objective quality). VMAF wears that relationship on its sleeve: it was literally built by regression onto subjective scores. When VMAF and a careful subjective test disagree, the test wins and the model failed on that content — the same as for any metric.
Reading a VMAF score: the 0-to-100 scale
VMAF runs from 0 to 100. Higher is better; 100 is the ceiling, reached when the encode is perceptually indistinguishable from the source under the model's viewing condition. The scale is designed to be roughly linear with human opinion, which is its great usability advantage over SSIM: a jump from 60 to 70 means about as much perceptual improvement as a jump from 80 to 90, so you can reason about differences arithmetically.
The rough, content-dependent bands are worth committing to memory. Below about 60, most viewers find the quality poor or annoying. Around 70 is fair — watchable but visibly compressed. Around 80 is good. The 90s are where streaming services aim for premium content: at roughly 93 and above, most viewers stop noticing compression under normal viewing. Treat these as orientation, not law — what a given VMAF "looks like" depends on the content, the screen, and the viewer, and only a subjective test is ground truth.
The most useful single fact for working with VMAF is the just-noticeable difference, or JND: about 6 VMAF points. Netflix's guidance is that a difference of roughly 6 points is the threshold at which most viewers (more than half) start to notice a quality change. A delta under about 2 points is in the noise and irrelevant; deltas above 2 begin to matter; deltas of 6 or more are visible to most people (Netflix VMAF guidance, 2017). That single number turns VMAF from an abstract score into a decision tool, as the worked example below shows.
Figure 2. The VMAF scale and its bands are content-dependent rules of thumb. The ~6-point JND is the practical threshold: smaller gaps are below what most viewers notice.
A worked example: turning a VMAF gap into a bitrate decision
Numbers make this concrete. Suppose you are choosing between two encodes of the same master, measured with the same model (default 1080p) and the same pooling (mean), apples-to-apples on the same frames:
- Encode A: 4.5 Mbps, VMAF 95.2
- Encode B: 3.0 Mbps, VMAF 93.4
Work the two questions out loud. First, is the quality difference one a viewer would see? Subtract the scores:
VMAF gap = 95.2 − 93.4 = 1.8 points
That 1.8 is below the ~6-point JND, and even below the ~2-point "irrelevant" floor. Most viewers would not see the difference between A and B. Second, what does choosing B cost in quality and save in bandwidth?
bitrate saving = (4.5 − 3.0) / 4.5 = 1.5 / 4.5 = 0.333 = 33%
Encode B delivers a 33% bitrate saving for a quality drop most people will never notice. Multiply that across millions of streams and it is the difference between an affordable service and an expensive one — which is exactly why Netflix built VMAF, and why a metric that tracks the eye pays for itself (the broader argument is in the business case for measuring quality). The decision rests entirely on the JND: without it, 1.8 points looks like a real difference; with it, you can see the saving is nearly free.
One caveat the example needs. VMAF predictions carry uncertainty, and the 0.6.2 and 0.6.3 models can report a 95% confidence interval computed by bootstrapping the model on its training residuals (Netflix Technology Blog, "VMAF: The Journey Continues", October 2018). If Encode B's score is reported as 93.4 with a 95% interval of, say, ±1.2 points — the range [92.2, 94.6] — then a second run on slightly different frames could land anywhere in that band. A gap of 1.8 points between two encodes, with intervals that wide, is well inside the noise. Reporting VMAF without its interval, then treating a sub-point difference as real, is a classic error. VMAF gives you error bars; use them. The practical depth on models, pooling, and confidence intervals is the subject of VMAF in depth.
A VMAF score is meaningless without its model
This is the point that separates people who use VMAF well from people who quote it badly. "VMAF 95" is not a quality statement until you name the model. The default model was trained for one specific viewing condition — a 1080p HDTV in a living room, viewed from three times the screen height. Change the screen or the distance and the right model changes, because the same distortion is more or less visible depending on how big the picture is and how close you sit.
Netflix ships several models for exactly this reason (Netflix VMAF model documentation, accessed 2026-06-23):
- The default model (
vmaf_v0.6.1) predicts quality on a 1080p TV at three times the screen height. It is the right default for most living-room streaming. - The phone model, invoked with a phone-viewing option, predicts quality on a small handset screen. Because a phone is small and held at a comfortable distance, the same compression is less visible, so the phone model reports higher scores than the default for the same encode. An encode that scores 88 on the default model can score in the high 90s on the phone model — and both are correct, for their viewing conditions.
- The 4K model (
vmaf_4k_v0.6.1, added 2018) predicts quality on a 4K TV viewed from 1.5 times the screen height, the distance at which a viewer can actually appreciate 4K sharpness.
The rule is strict: a VMAF comparison is only valid when every number uses the same model, computed by the same tool and version, on the same frames at the same resolution. Compare a phone-model score against a default-model score and you are comparing two different questions. When you report VMAF, state the model — "VMAF 93.4, default v0.6.1 model, mean pooling" — never just "VMAF 93.4".
Figure 3. The same encode scores differently under each model because each predicts a different viewing condition. Name the model, or the number is ambiguous.
VMAF v1: the 2026 generation
In June 2026 Netflix released a new generation of models, VMAF v1, the first substantial change to the feature set since the metric launched. It matters for anyone reading or quoting VMAF today, because v1 fixes several of v0's best-known blind spots — and because the two generations are not interchangeable (Netflix Technology Blog, "VMAF v1: Good Is Not Good Enough", June 2026; Netflix VMAF v1 model documentation).
Four changes stand out. First, v1 adds a banding detector. Plain v0 was largely blind to banding — the staircase steps that appear in what should be a smooth sky or gradient — so an encode could band visibly while VMAF stayed high. V1 folds in CAMBI, Netflix's banding-artifact detector, as a feature, so the score now reacts to banding (banding itself is covered in banding: when smooth gradients break into steps). Second, v1 sees colour. V0 used only luma (brightness) features and was unaware of chroma artifacts like colour bleeding; v1 adds chroma features so colour damage now affects the score. Third, v1 handles viewing distance more cleanly: rather than bolting on separate phone and 4K models after the fact, v1 adjusts its feature calculations for the normalised viewing distance directly, then trains one consistent model that is reapplied for phone (5H), 4K at 1.5H, and a consumer 4K-at-3H condition — the last using an extended [0, 110] range to quantify the extra benefit of 4K over 1080p at the same distance. Fourth, and surprising: v1 drops VIF. The information-fidelity feature that anchored v0 was computationally expensive and, once the other features were improved, no longer added meaningful accuracy — so removing it made VMAF both more accurate and faster.
Two practical notes. V1 is best computed at 10-bit precision even for 8-bit SDR content, because the extra precision helps it see banding; you can measure an 8-bit encode at 10 bits by preprocessing both inputs. And v1 ships dedicated high-frame-rate variants for ~50/60 fps content, using a wider five-frame motion window to fix the under-prediction v0 showed on fast content. The headline for a working engineer: if you are starting a new measurement program in 2026, evaluate v1; if you have a history of v0 numbers, do not mix the two in one comparison — re-baseline.
NEG mode: when the metric can be gamed
VMAF has one property that is a feature in delivery and a trap in codec comparison: it rewards image enhancement. Because it was trained to predict what looks good, sharpening or boosting contrast before encoding can raise the VMAF score, even though nothing about the compression improved — the encoder just learned to please the metric (Netflix Technology Blog, "Toward a Better Quality Metric for the Video Community", December 2020).
That is fine when you really did enhance the picture for viewers, but it corrupts an encoder-versus-encoder test, where you want to measure compression gain alone. Netflix's fix is VMAF-NEG — No Enhancement Gain — a mode that disables the score boost from enhancement so a sharpening trick cannot inflate the result. Use the standard model to predict delivered quality; use NEG when comparing encoders so nobody can game the comparison. This "gaming the metric" problem, and exactly when to report NEG, is the subject of its own article, VMAF-NEG explained.
How to compute VMAF with FFmpeg and libvmaf
You will almost never run the model yourself; the everyday path is FFmpeg with the libvmaf filter, which wraps Netflix's library. The filter takes two inputs — the distorted clip first, the reference second — and writes a log with the per-frame and pooled VMAF. A current invocation (FFmpeg 7.x, libvmaf 3.x) looks like this:
# VMAF of a distorted encode against its reference master.
# Distorted is the FIRST input, reference the SECOND.
# Writes a JSON log with per-frame and pooled VMAF.
ffmpeg -i distorted.mp4 -i reference.mp4 \
-lavfi "libvmaf=log_path=vmaf.json:log_fmt=json:n_threads=4" \
-f null -
Two rules the command hides. First, the two inputs must match in resolution and frame rate; if the distorted clip was encoded at a lower resolution, scale it up to the reference resolution first, or the numbers are meaningless. Second, if you do not name a model, you get the default (vmaf_v0.6.1) — so to use the phone, 4K, or v1 models, point the filter at the model explicitly:
# Use an explicit model file (here a v1 model) and also emit PSNR alongside.
ffmpeg -i distorted.mp4 -i reference.mp4 \
-lavfi "[0:v]scale=1920:1080:flags=bicubic[dis];[dis][1:v]libvmaf=\
model='path=/usr/share/model/vmaf_v1.0.16/vmaf_v1.0.16_3d0h.json':\
feature='name=psnr':log_path=vmaf.json:log_fmt=json" \
-f null -
The output JSON carries a VMAF value per frame plus the pooled score; the pooling defaults to the mean but can be set to harmonic mean or a low percentile to surface the worst moments — a choice that matters as much as the model, covered in pooling: per-frame to one number. The full FFmpeg-and-libvmaf workflow, with model selection and output parsing, lives in measuring quality with FFmpeg and libvmaf; the encoder-side quick reference is in Video Encoding's FFmpeg cheat sheet.
To make the score concrete at your desk, we built a small, dependency-light script that runs VMAF on a pair of files, then parses the JSON to print the pooled score three ways — mean, harmonic mean, and 5th-percentile (the worst moments) — alongside the per-frame minimum and the implied just-noticeable-difference band, so you can see how much the single headline number hides. Download the VMAF measurement-and-pooling script (Python) and run it against your own encodes.
Where VMAF still lies
VMAF is the best widely-deployed full-reference metric, not an oracle. Naming its blind spots is the price of using it well, and the honest core of this section.
First, a bare VMAF number is ambiguous — without the model, pooling, and content named, it cannot be interpreted or compared, as the model section showed. This is less a flaw in VMAF than the most common flaw in how people quote it.
Second, it can be gamed by enhancement, which is why NEG exists. An encoder tuned to maximise VMAF can learn to please the metric in ways that do not please a viewer — the general hazard of optimising for any metric instead of the eye.
Third, it is trained, so it is weakest off its training distribution. VMAF learned from a particular kind of content — professionally-produced video clips. On content unlike its training set — screen recordings and text, video-game capture, heavy film grain, animation, extreme high dynamic range — its predictions are less reliable, and it can rank encodes in an order a viewer would not. The v0 generation was additionally blind to banding and colour; v1 narrows but does not erase these gaps. Which content fools which metric is catalogued in where objective metrics lie.
Fourth, mean pooling hides the worst seconds. A clip with one badly broken passage and a lot of clean footage can still post a high mean VMAF, because the good frames outvote the bad. When a localised failure matters, read a low percentile or the per-frame minimum, not just the mean.
Fifth, it is full-reference, so it is useless for live and user-generated content where there is no pristine master to compare against — that hard case needs the no-reference metrics covered in no-reference quality for live and UGC.
Figure 4. Why VMAF tracks the eye better: against human Mean Opinion Scores, VMAF's predictions cluster closer to the ideal diagonal than PSNR's. The fit is measured by correlation coefficients, not assumed.
How VMAF compares with PSNR and SSIM
VMAF is the most perception-accurate of the three full-reference metrics you meet constantly, and the most demanding to quote correctly. The fastest way to hold all three in your head is side by side, with the columns that actually matter — what each one measures, and where each one lies.
| Metric | Scale | What it measures | Where it lies (blind spot) | Best for |
|---|---|---|---|---|
| PSNR | dB (≈20–50, ∞ if identical) | Average pixel error vs the original | Ignores where errors land; weak perceptual match | Same-codec comparison, RDO, regression gates |
| SSIM | 0–1 (higher better) | Structural similarity: luminance, contrast, structure | Luma-only, per-frame, saturates near 1; implementation-dependent | A better-than-PSNR structural proxy; quality maps |
| VMAF | 0–100 (higher better) | Fused, perception-trained quality (model-dependent) | Meaningless without the model named; gameable; trained-distribution limits | Predicting viewer-perceived streaming quality at scale |
Table 1. The three full-reference metrics at a glance. All need the original; they differ in how close they get to the eye and what they miss. Pick the row by the job, and always name the conditions — the tool and version for SSIM, the model and pooling for VMAF.
This is the deep dive on the bottom row. The encoder-operator's one-screen version of all three is in Video Encoding's quality-metrics article; when you are choosing between metrics for a specific job, choosing the right metric is the decision guide, and how each is graded against the eye is in validating metrics against human scores.
Common mistakes with VMAF
Mistake: quoting "VMAF 95" with no model named. The default, phone, and 4K models give different scores for the same encode because they predict different viewing conditions. A VMAF number without its model, pooling, and content is uninterpretable. Always report "VMAF 95, default v0.6.1, mean pooling".
Mistake: treating a sub-JND gap as a real difference. About 6 VMAF points is the just-noticeable difference; under ~2 points is noise. A 1-point "win" between two encodes is almost certainly not visible — and may be inside the model's confidence interval. Compare gaps against the JND, not against zero.
Mistake: using the standard model to compare encoders. Sharpening can inflate standard VMAF without improving compression. For encoder-versus-encoder tests use VMAF-NEG, which disables the enhancement gain, so the comparison measures compression alone.
Mistake: mixing VMAF versions or pooling methods. A v0 number and a v1 number are not comparable; neither are a mean-pooled and a harmonic-mean-pooled score. Apples-to-apples means same model, same version, same pooling, same frames, same resolution.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and VMAF is the metric we reach for when the question is "how good will this look to the viewer," because it is the closest widely-deployed proxy for the eye. We use it to set quality targets for bitrate ladders, to gate encoder changes before they reach users, and to put a defensible number on delivered quality — always with the model, pooling, and content stated, and always read against the ~6-point JND rather than chased to the last decimal. We move to VMAF-NEG for encoder comparisons so nobody games the result, and to a low percentile when the worst seconds matter more than the average. Our benchmark methodology records the exact model and pooling behind every VMAF figure we publish, because a quality number you cannot reproduce is marketing, not measurement.
What to read next
- VMAF in depth: models, phones, 4K, and confidence intervals
- VMAF-NEG: the no-enhancement-gain model and why it exists
- SSIM explained: structural similarity and why it beats PSNR
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your vmaf plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Netflix Technology Blog, "Toward a Practical Perceptual Video Quality Metric," June 6, 2016. Tier 1 (metric-author defining work). The launch and open-sourcing of VMAF: the fusion idea, the SVM regressor, the VIF/DLM/motion features, and training on the NFLX subjective dataset. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
- Netflix Technology Blog, "VMAF: The Journey Continues," October 25, 2018. Tier 1 (metric-author). Adds the phone and 4K models, the bootstrapped 95% confidence interval (models 0.6.2/0.6.3), and best-practice guidance on pooling. https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12
- Netflix Technology Blog, "Toward a Better Quality Metric for the Video Community," December 7, 2020. Tier 1 (metric-author). The libvmaf v2 API redesign, speed work, and the introduction of NEG (No Enhancement Gain) mode for codec evaluation. https://netflixtechblog.com/toward-a-better-quality-metric-for-the-video-community-7ed94e752a30
- Netflix Technology Blog, "VMAF v1: Good Is Not Good Enough," June 2026. Tier 1 (metric-author). The v1 generation: CAMBI banding feature, chroma features, normalised-viewing-distance feature adjustment, removal of VIF, and the accuracy/speed gains. https://medium.com/netflix-techblog/vmaf-v1-good-is-not-good-enough-60d7e4244ea8
- Netflix/vmaf repository, "Models" and "Models (v1)" documentation (
resource/doc/models.md,resource/doc/models_v1.md), accessed 2026-06-23. Tier 1/Tier 3 (metric-author reference implementation). The default/phone/4K viewing conditions and score mapping; the v1.0.16 model files, score ranges ([0,100] and [0,110]), 10-bit guidance, CAMBI encode-side parameters, and HFR variants. https://github.com/Netflix/vmaf/blob/master/resource/doc/models_v1.md - H. R. Sheikh and A. C. Bovik, "Image Information and Visual Quality," IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, 2006. Tier 1 (metric-author defining paper). Defines Visual Information Fidelity (VIF), the information-fidelity feature in VMAF v0. https://ieeexplore.ieee.org/document/1576816
- S. Li, F. Zhang, L. Ma, and K. N. Ngan, "Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments," IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, 2011. Tier 1 (metric-author defining paper). Defines the Detail Loss Metric (DLM/ADM), the detail-loss feature in VMAF. https://ieeexplore.ieee.org/document/5765502
- Recommendation ITU-T P.1401 (01/2020), "Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models." International Telecommunication Union. Tier 1. Defines PCC, SROCC, and RMSE — how VMAF and every objective metric is graded against subjective MOS. https://www.itu.int/rec/T-REC-P.1401
- Recommendation ITU-R BT.500-15 (2023), "Methodologies for the subjective assessment of the quality of television images." International Telecommunication Union. Tier 1. The subjective ground truth against which VMAF was trained and is validated. https://www.itu.int/rec/R-REC-BT.500
- FFmpeg, "libvmaf filter documentation," accessed 2026-06-23. Tier 3 (first-party tooling). The
libvmaffilter (distorted-then-reference input order,model,feature,log_path/log_fmt,n_threads, pooling), the defaultvmaf_v0.6.1model, and the input-matching requirement. https://ffmpeg.org/ffmpeg-filters.html#libvmaf - Netflix/vmaf repository (libvmaf), release notes and CHANGELOG, accessed 2026-06-23. Tier 3 (metric-author tooling). libvmaf v2.0.0 (.json model format), v3.0.0 (Dec 2023, deprecated-API removal, CUDA), and the threading improvements that benefit v0 and v1 models. https://github.com/Netflix/vmaf/releases
- J. Ozer, "Finding the Just Noticeable Difference with Netflix VMAF," Streaming Learning Center, accessed 2026-06-23. Tier 6 (educational, orientation only). Source for the widely-cited ~6-point JND rule of thumb (delta <2 irrelevant, >6 noticeable by most viewers); the underlying guidance is Netflix's. https://streaminglearningcenter.com/codecs/finding-the-just-noticeable-difference-with-netflix-vmaf.html


