Beyond VMAF: ITU-T P.1204, AVQT & Next-Gen Metrics

Why this matters

If you only know VMAF, you will reach for it in situations it was never built for: scoring a live stream where you have no pristine reference, judging how a clip looks on a phone held at arm's length, or ranking a neural codec whose artifacts VMAF has never seen. Each of those is a different question, and there is now a metric aimed at each one. This article is for the video engineer, encoding lead, or QA engineer who has standardized on VMAF and needs to know what else exists, when the alternative is actually better, and when the newer option is still a research toy. It assumes you have read VMAF explained and the model and pooling discipline in VMAF in depth; the short encoder-side version of the metrics lives in our Video Encoding section's quality-metrics overview.

Three things VMAF cannot do

Start with what VMAF is, so the gaps are obvious. VMAF — Video Multimethod Assessment Fusion, Netflix's machine-learned perceptual score on a 0–100 scale — is a full-reference metric: it needs the pristine original frame and the compressed frame side by side, and it compares them pixel region by pixel region. That design is exactly right for a codec comparison on your own content, and it is why VMAF became the industry default. But the same design closes three doors.

First, VMAF needs the original. A metric that compares the compressed frame to the pristine source — the meaning of "full-reference" — is useless the moment you do not have the source: a live broadcast you are monitoring downstream, a user-generated clip that arrived already compressed, or any measurement point inside a content-delivery network where the master never travels. The metrics that work without the original are called no-reference (or "blind"); the taxonomy is laid out in full-reference, reduced-reference, no-reference.

Second, VMAF scores the picture, not the viewing. The same encode really does look different on a 65-inch television three feet away than on a phone at arm's length, because distance and display size hide or expose artifacts. VMAF's phone and 4K models nudge at this, but they are content models, not a true model of your screen and your viewing distance.

Third, VMAF was trained on the codecs of its era. Its model learned the relationship between its features and human opinion on the block-based, DSP-style codecs of the 2010s (H.264, HEVC, VP9, AV1). Neural codecs that reconstruct frames with deep networks produce different-looking errors — invented detail, smeared motion — and VMAF has no training for them. This is the frontier the learned successors are chasing.

A two-axis map placing PSNR, SSIM, VMAF, AVQT, and the ITU-T P.1204 models by how much of the original they need and what they measure. Figure 1. The measurement landscape. VMAF and AVQT are full-reference picture metrics; P.1204.3 and P.1204.5 read quality with no original; the learned metrics are the emerging edge.

ITU-T P.1204: quality from the bitstream, no original required

The first family that fills a real VMAF gap is ITU-T P.1204, a set of standardized video-quality models published by the International Telecommunication Union in January 2020, built for streaming over reliable transport at resolutions up to 4K/UHD-1 (ITU-T P.1204, 2020). It is not one model but three, distinguished by what they are allowed to look at.

The headline member is P.1204.3, a bitstream-based model. Instead of decoding the video and comparing pixels, it parses the compressed bitstream itself — reading the quantization parameters, motion vectors, frame sizes, and transform-coefficient statistics the encoder wrote — and feeds those features into a machine-learned model that predicts a Mean Opinion Score, the average rating a panel of humans would give, on a 1–5 scale (ITU-T P.1204.3, 2020; Rao et al., IEEE QoMEX 2020). Because it never needs the pristine original and never decodes to pixels, it is a no-reference metric in the strict sense, and it is light enough to run anywhere the bitstream travels: at the origin, on a CDN node, inside a network probe, or on the client device.

The other two trade that freedom for more input. P.1204.4 is a pixel-based model that does use the reference (a reduced-reference design that behaves close to full-reference), for cases where you have decoded pixels and want pixel-level accuracy. P.1204.5 is a hybrid no-reference model that combines bitstream data with the received decoded pixels but still no original — the single best performer in its category in the standard's evaluation (ITU-T P.1204, 2020; TU Ilmenau / Telecommunication-Telemedia-Assessment). All three were trained and validated in the P.NATS Phase 2 competition run jointly by ITU-T Study Group 12 and the Video Quality Experts Group (VQEG), using ratings from more than 600 subjects across roughly 5,000 sequences.

The ITU-T P.1204 family — bitstream P.1204.3, pixel P.1204.4, hybrid P.1204.5 — feeding short-term video quality into the P.1203 streaming-session QoE framework. Figure 2. The P.1204 family by input type, and how it plugs into P.1203's session-level Quality of Experience model alongside the audio and integration modules.

The number that should make you look twice

Here is the result that earns P.1204.3 a place in this article. The model's developers at TU Ilmenau scored it against the open AVT-VQDB-UHD-1 database — 756 sequences and 19,620 human ratings, deliberately held out of the standard's own training — and compared it to PSNR, SSIM, MS-SSIM, and VMAF on the same clips (TU Ilmenau technical report; reported by Ozer, Streaming Learning Center, 2020).

The agreement with human opinion is measured with the Pearson correlation coefficient, a number from −1 to 1 where 1 means the metric tracks human scores perfectly; the statistic is explained in validating metrics against human scores. P.1204.3 scored 0.942. VMAF scored 0.873. The gap is easy to read:

P.1204.3 PCC   0.942
VMAF PCC       0.873
difference     0.069  → the bitstream model tracked the eye more closely

Sit with what that means. A model that never saw the original file and never decoded a single pixel predicted human opinion more accurately than full-reference VMAF, which had the pristine source to compare against. The reason is that the bitstream carries the encoder's own confession — the quantization it applied, the motion it gave up on — and a model trained on that confession can be remarkably honest. The catch, and it is a real one, is that P.1204.3 is defined only for H.264/AVC, HEVC, and VP9 bitstreams; it has no parser for AV1, and it cannot score content it cannot parse.

Where P.1204 actually fits: streaming QoE

P.1204 was not designed in isolation. It plugs into ITU-T P.1203, the first standardized model of streaming Quality of Experience — the quality of the whole viewing session, not one clip (ITU-T P.1203; TU Ilmenau). P.1203 combines a short-term video module, an audio module, and an integration module that folds in the things that ruin a session but never touch a single frame's fidelity: the initial loading delay, rebuffering stalls, and quality switches. The picture metrics in connecting objective metrics to QoE and the delivery side in our Video Streaming section's adaptive-bitrate explainer are the two halves this framework joins. That is the deeper point: VMAF answers "how good is this frame"; the P.1203/P.1204 stack answers "how good was the session," which is the question a streaming operator actually has.

Apple AVQT: scoring the viewing, not just the picture

The second alternative comes from Apple. The Advanced Video Quality Tool (AVQT), introduced at WWDC 2021, is a macOS command-line tool that estimates the perceptual quality of a compressed video on a 1-to-5 scale that mirrors a Mean Opinion Score, where 1 is bad and 5 is excellent (Apple, WWDC21). Like VMAF and PSNR, it is full-reference: it takes the source and the compressed file and reports both per-frame scores and per-segment scores (a segment defaults to six seconds).

Two things make AVQT worth knowing about. The first is viewing-setup awareness. AVQT takes the display size, the display resolution, and the viewing distance as explicit inputs, and adjusts the predicted quality to match how a real viewer in that setup would perceive it — because the same artifact that is glaring on a monitor at arm's length is invisible on the same content viewed from across the room. Apple's own example: hold the content and display fixed and move the viewer from 1.5 times the screen height back to 3 times the screen height, and the AVQT score rises, because the far viewer simply cannot see the artifacts the near viewer can (Apple, WWDC21). VMAF has no equivalent knob.

The second is speed and format reach. AVQT is built on Apple's AVFoundation, so it ingests compressed files directly with no separate decode-to-raw step, and it offloads the heavy pixel math to the GPU through Metal. Apple quotes roughly 175 frames per second on 1080p content. The arithmetic is worth doing once:

10-minute 1080p clip at 24 fps
  = 10 min × 60 s × 24 fps
  = 14,400 frames
14,400 frames ÷ 175 fps ≈ 82 s ≈ 1.4 minutes to score

AVQT also natively handles high-dynamic-range formats — HDR10, HLG, and Dolby Vision — which full-reference pipelines often mishandle, and it works across content types (animation, natural scenes, sports) where Apple showed PSNR failing: two clips with the same PSNR of about 35 received AVQT scores of 4.4 and 2.5, and the lower one was the one with visible facial artifacts (Apple, WWDC21). The 2022 update added the ability to score chosen time windows, an interactive HTML report, and a frame-distribution view across Bad/Poor/Fair/Good/Excellent bands (Apple, WWDC22).

AVQT takes the source, the compressed file, and the viewing setup, and outputs frame and segment scores on a 1-to-5 opinion scale that rises as viewing distance grows. Figure 3. AVQT's distinguishing input: the viewing setup. The same encode scores higher when the viewer sits farther away, because distance masks artifacts.

The honest limits matter as much as the strengths. AVQT runs only on macOS, which makes it awkward to scale inside a Linux CI/CD encoding pipeline; teams resort to Mac hardware or cloud Mac instances. It is closed-source with no published model details, so you cannot inspect how a score was reached or plug in a custom model the way VMAF allows. And Apple benchmarked it against PSNR, not against VMAF, so a like-for-like accuracy comparison with the incumbent is left to you (Fraunhofer FOKUS, 2021). For an Apple-centric shop measuring HDR on real screens, AVQT is a genuine upgrade; for a cross-platform automated pipeline, it is friction.

The next generation: learned metrics for learned codecs

The newest frontier is not a single metric but a problem. As codecs move from block-based math to neural networks that reconstruct frames, the artifacts change shape, and metrics trained on the old artifacts start to mislead. The evidence is concrete: in a 2025 review, a Deep Render neural codec showed a 45% bitrate advantage over SVT-AV1 in subjective testing but only 3% on VMAF — the metric almost entirely missed the win that humans saw (Ozer, Streaming Learning Center, 2025).

Formal studies confirm the pattern. On JPEG AI, a learned image codec, VMAF's rank correlation with human scores fell from 0.928 across all codecs to 0.899 on the AI codec alone, and most metrics were "overly optimistic" about AI-compressed quality (Subjective study, arXiv 2504.06301, 2025). For learned video codecs, a Microsoft team built MLCVQA, a metric trained directly on neural-codec content; it reached a rank-correlation (Kendall's Tau-b 95) of 0.93 at the model level against a retrained VMAF's 0.90 — better, but requiring eight high-end GPUs to train and with no FFmpeg or VQMT integration yet (Majeedi et al., arXiv 2309.00769; Ozer, 2025). There is a twist that ties back to the previous article in this section: VMAF-NEG, the no-enhancement-gain model from VMAF-NEG explained, actually scored better on JPEG AI than standard VMAF — a coincidental alignment, not a validated fit, and not a reason to trust it on neural content yet.

Netflix itself moved the baseline in June 2026 with VMAF v1, which fixed two long-standing v0 weaknesses — a bias toward compression artifacts over downscaling, and blindness to banding — and turned on the NEG enhancement-gain limit by default (Netflix Technology Blog, 2026). The likely shape of the real successor is a "vmaf_ai": the same interpretable, fusion-based framework, retrained on neural-codec outputs with learned features. Until something like that ships in the tools engineers actually run, the practical answer for neural codecs is unchanged — triangulate several metrics, then confirm with a real subjective test, because for this content the eye is still the only ground truth.

A side-by-side, with each metric's blind spot

Put the four families in one table. The columns that matter most are the last two: what each metric actually measures, and where it lies.

Metric	Reference needed	Scale	What it measures	Where it lies / blind spot
VMAF (v0/v1)	Full-reference	0–100	Fused perceptual fidelity vs the source	No original means no score; weak on neural codecs; v0 misses banding
P.1204.3	None (bitstream)	1–5 MOS	Quality inferred from encoder decisions in the bitstream	H.264/HEVC/VP9 only — no AV1; needs bitstream access
AVQT	Full-reference	1–5 MOS	Perceptual quality for a specific screen and distance	macOS-only, closed-source, no custom models
MLCVQA / learned	Full-reference	trained	Quality of neural-codec artifacts specifically	Research-stage; heavy GPU; not in FFmpeg/VQMT

Read it as a set of jobs, not a ranking. There is no "best" row — there is a best metric for your reference availability, platform, codec, and content. The decision tree below turns the table into a path.

A decision tree routing from reference availability, platform, codec type, and whether you measure a session to VMAF, P.1204, AVQT, or subjective testing. Figure 4. Which metric to reach for. The first question is always whether you have the original; the codec and the platform decide the rest.

Do you need them yet?

For most teams in 2026, the honest answer is: VMAF (and VMAF-NEG for comparisons) remains the right default, and you reach for the others only when you hit one of VMAF's three walls. If you have the source and you are comparing block-based codecs, stay on VMAF — it is open, scriptable, cross-platform, and well understood. Add a second metric and you are following the section's standing advice on choosing the right metric.

Reach past VMAF when the situation forces it. If you must measure where the original is not available — monitoring live or delivered streams in the network — P.1204.3 or another no-reference model is not a nicety, it is the only kind of metric that can run there, a case covered in no-reference quality for live and UGC. If you are an Apple-platform shop optimizing HDR delivery and you care how content looks on a specific device, AVQT's viewing-setup and HDR awareness earn their keep. And if you are evaluating a neural codec, no current production metric is trustworthy on its own — budget for subjective testing.

Common mistake: scoring across metric scales as if they were one ruler. VMAF is 0–100; P.1204.3 and AVQT are 1–5 Mean Opinion Scores; PSNR is in decibels. A "4.2" from AVQT and an "88" from VMAF are not comparable, and converting one to the other by rescaling is meaningless — they were trained on different data against different panels. Pick one metric per comparison, state its version and model, and never mix scales in a single table or chart.

Where Fora Soft fits in

Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, OTT, e-learning, telemedicine, and surveillance — and the reason we track metrics beyond VMAF is that our work routinely lands in VMAF's blind spots. Surveillance and live conferencing give us no pristine reference to compare against, which is exactly where a no-reference, bitstream-style model belongs; OTT and telemedicine delivery to specific devices is where viewing-condition-aware scoring matters; and when a client asks about a neural codec, we say plainly that the metric alone is not enough and bring in a subjective test. We record the metric, version, model, and viewing assumptions behind every number in our benchmark methodology, so a result is reproducible rather than flattering.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your itu-t p.1204 plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.

References

Recommendation ITU-T P.1204, "Video quality assessment of streaming services over reliable transport for resolutions up to 4K," International Telecommunication Union, January 2020. Tier 1 (official standard). The umbrella recommendation defining the P.1204 model family (bitstream P.1204.3, pixel P.1204.4, hybrid P.1204.5), its scope up to 4K/UHD-1 over reliable transport, and the MOS output. https://www.itu.int/rec/T-REC-P.1204
Recommendation ITU-T P.1204.3, "Video quality assessment of streaming services over reliable transport for resolutions up to 4K — bitstream-based model," International Telecommunication Union, January 2020. Tier 1 (official standard). The bitstream-based, no-reference short-term video-quality model: features parsed from the bitstream (QP, motion vectors, frame sizes, transform coefficients), the 1–5 MOS output at segment and per-1-second level, and the H.264/HEVC/VP9 codec scope. https://www.itu.int/rec/T-REC-P.1204.3
Recommendation ITU-T P.1203 (and P.1203.1/.2/.3), "Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport," International Telecommunication Union, 2017. Tier 1 (official standard). The first standardized HTTP-adaptive-streaming session-QoE model: the video (Pv), audio (Pa), and integration (Pq) modules and the stalling/startup/switching factors P.1204 short-term video quality feeds into. https://www.itu.int/rec/T-REC-P.1203
R. R. R. Rao, S. Göring, W. Robitza, B. Feiten, P. List, et al., "Bitstream-based Model Standard for 4K/UHD: ITU-T P.1204.3 — Model Details, Evaluation, Analysis and Open Source Implementation," IEEE QoMEX 2020 (TU Ilmenau Audiovisual Technology Group). Tier 1 (model-author defining work). The model architecture, the open reference implementation and bitstream parser, and the validation including the AVT-VQDB-UHD-1 comparison against PSNR/SSIM/MS-SSIM/VMAF. https://www.researchgate.net/publication/341792225
A. Raake, S. Borer, S. M. Satti, J. Gustafsson, R. R. R. Rao, et al., "Multi-model standard for bitstream-, pixel-based and hybrid video quality assessment of UHD/4K: ITU-T P.1204," IEEE Access, vol. 8, 2020. Tier 1 (model-author defining work). The full P.1204 standard description: the three model types, the P.NATS Phase 2 / VQEG development with 600+ subjects and ~5,000 sequences, and the relative performance of the bitstream, pixel, and hybrid variants. https://ieeexplore.ieee.org/document/9234526
Apple, "Evaluate videos with the Advanced Video Quality Tool (AVQT)," WWDC21 session 10145, June 2021. Tier 3 (first-party tool documentation). AVQT as a macOS full-reference tool: the 1–5 MOS scale, frame and 6-second-segment scores, Metal/AVFoundation acceleration (~175 fps on 1080p), HDR support, the cross-content PSNR-vs-AVQT example, and viewing-setup awareness (display size/resolution/distance, the 1.5H→3H trend). https://developer.apple.com/videos/play/wwdc2021/10145/
Apple, "What's new in AVQT," WWDC22 session 10149, June 2022. Tier 3 (first-party tool documentation). The 2022 AVQT additions: scoring selected time windows, the interactive HTML report, and the Bad/Poor/Fair/Good/Excellent frame-distribution view. https://developer.apple.com/videos/play/wwdc2022/10149/
ITU-T P.1203/P.1204 models and development, Telecommunication-Telemedia-Assessment (TU Ilmenau Audiovisual Technology Group). Tier 3 (model-author project site / reference software). The plain-language description of P.1204.3 (bitstream), P.1204.4 (pixel, uses the reference), P.1204.5 (hybrid no-reference, best in category), the open reference implementations, the bitstream parser, and the AVT-VQDB-UHD-1 database. https://telecommunication-telemedia-assessment.github.io/bitstream_based_models/
C. G. Bampis, Z. Li, K. Swanson, N. Fons Miret, P. Madhusudanarao, "VMAF v1: Good Is Not Good Enough," Netflix Technology Blog, June 2026. Tier 1 (metric-author). The June 2026 VMAF v1 release: fixes for v0's compression-vs-downscaling bias and banding blindness, and NEG enabled by default — the moving baseline this article's "next generation" section references. https://medium.com/netflix-techblog/vmaf-v1-good-is-not-good-enough-60d7e4244ea8
J. Ozer, "When Metrics Mislead: Evaluating AI-Based Video Codecs Beyond VMAF," Streaming Learning Center, May 2025. Tier 6 (educational, orientation). The Deep Render 45%-subjective-vs-3%-VMAF disconnect, the JPEG AI correlation drop, the MLCVQA results and the VMAF-NEG outlier, and the "triangulate then subjective-test" guidance; underlying data attributed to the cited arXiv papers. https://streaminglearningcenter.com/articles/video-quality-metrics-for-ai-codecs.html
M. Majeedi et al., "Full Reference Video Quality Assessment for Machine Learning-Based Video Codecs," arXiv 2309.00769 (Microsoft MLCVQA). Tier 5 (peer-reviewed/institutional). Evidence that PSNR/MS-SSIM/VMAF lose accuracy on ML codecs, and the MLCVQA model (SlowFast + frame-level features) reaching Tau-b 95 of 0.93 vs retrained VMAF's 0.90. https://arxiv.org/pdf/2309.00769
J. Ozer, "Introducing ITU-T Metrics P.1203 and P.1204," Streaming Learning Center, August 2020. Tier 6 (educational, orientation). Accessible summary of the TU Ilmenau technical report, including the P.1204.3 PCC 0.942 vs VMAF 0.873 result on AVT-VQDB-UHD-1 and the bitstream-vs-full-reference distinction; underlying data attributed to TU Ilmenau (refs 4, 5, 8). https://streaminglearningcenter.com/blogs/itu-t-p1203-p1204.html

Why this matters

Three things VMAF cannot do

ITU-T P.1204: quality from the bitstream, no original required

The number that should make you look twice

Where P.1204 actually fits: streaming QoE

Apple AVQT: scoring the viewing, not just the picture

The next generation: learned metrics for learned codecs

A side-by-side, with each metric's blind spot

Do you need them yet?

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Beyond VMAF: ITU-T P.1204, AVQT & Next-Gen Metrics

Why this matters

Three things VMAF cannot do

ITU-T P.1204: quality from the bitstream, no original required

The number that should make you look twice

Where P.1204 actually fits: streaming QoE

Apple AVQT: scoring the viewing, not just the picture

The next generation: learned metrics for learned codecs

A side-by-side, with each metric's blind spot

Do you need them yet?

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

VMAF

AVQT

ITU-T P.1204

PSNR

SSIM

ITU-T P.1203

VMAF-NEG

Banding