Why this matters
If you ship streaming, you measure two things that rarely sit in the same report: an encode-time picture score (VMAF, SSIM, PSNR) from your quality gate, and a pile of playback metrics (startup, rebuffering, switching) from your analytics stack. Treated separately, they give contradictory advice — the encoding team pushes picture quality up, the delivery team pushes stalls down, and nobody owns the experience the viewer actually had. This article is for the engineer, QoE analyst, or product owner who needs one defensible view of quality that respects both. It explains why the two numbers do not simply add up, and gives you a concrete method — and a calculator — to combine them without fooling yourself.
Two questions, two objects
Start by naming what each side actually measures, because most confusion here comes from treating two different objects as if they were one.
A picture-quality metric answers: how good does this compressed video look compared to the original? The leading example is VMAF — Video Multi-method Assessment Fusion, Netflix's metric that fuses several perceptual features into a single 0–100 score. VMAF is full-reference: it needs the pristine master to compare against, frame by frame. It runs at encode time, on a file, and it scores the encode — the rendition your encoder produced. SSIM (structural similarity, 0–1) and PSNR (peak signal-to-noise ratio, in decibels) are the same kind of object: full-reference, encode-time, picture-only. For the full treatment of each, see VMAF explained and pooling per-frame scores into one number.
Quality of experience, QoE, answers a different question: did the viewer have a good time? It is measured on the device during playback, from the player's own events — when playback started, when it stalled, when it switched rendition — and it is usually expressed as a mean opinion score (MOS) on a 1–5 scale, where 1 is bad and 5 is excellent. QoE needs no reference: the player never had the master and does not compare pixels. It scores the session, not the file. The four delivery families that feed it each have their own article: rebuffering, startup time, bitrate and switching, and the player-side metrics that capture them. QoE and QoS are not the same thing either — see QoE vs QoS.
Think of it as a restaurant. The picture metric is the chef tasting the dish in the kitchen: was it cooked well? QoE is the diner's review of the whole evening: was the food good, and did it arrive on time, and did the waiter keep changing the plate mid-bite? A perfect dish that arrives cold and late earns a bad review. That is why a beautiful picture that buffers loses to an average picture that never does — and why you cannot grade the evening by tasting the sauce.
Figure 1. Two different objects: a picture metric scores the encode (full-reference, encode-time, 0–100); QoE scores the session (no reference, playback-time, MOS 1–5). The bridge between them is the subject of this article.
Why you cannot read QoE off a VMAF number
There are two independent reasons a high VMAF does not promise a good experience.
First, VMAF was never built to see delivery. Netflix is explicit about this: VMAF was developed with compression and scaling artifacts as the two impairments of interest, and as a full-reference metric it does not capture long-term effects such as recency and primacy, nor rebuffering events ("VMAF: The Journey Continues", Netflix Technology Blog, 2018). Its temporal feature is a simple motion measure — the low-pass-filtered difference between adjacent frames — not a model of stalls or switches. So VMAF can return 96 on every rendition while the session is a stutter of freezes; the metric is simply blind to the thing that ruined the experience. Naming that blind spot is the whole discipline of this section, covered in what a metric can and cannot tell you.
Second, the picture the viewer saw is not the picture you scored. Your quality gate scored the top rung of the ladder — the best rendition the encoder produced. But adaptive bitrate streaming (ABR) delivers whatever rendition the network allowed, second by second. If the player spent a third of the session on a lower rung, the viewer's effective picture quality is the time-weighted track of the played renditions, not the top-rung score. The difference is real and measurable. In the worked example below, an encoder made a top-rung rendition at VMAF 96, but because ABR dropped to a VMAF-80 rendition six times, the viewer's time-weighted played VMAF was only 91.2 — a gap of 4.8 VMAF the delivery gave away. Reading QoE off the top-rung number would overstate what the viewer actually received.
The mapping problem: VMAF 0–100 is not MOS 1–5
Before combining anything, kill the most tempting shortcut: rescaling VMAF onto the MOS scale and calling it QoE. VMAF runs 0–100; MOS runs 1–5. It is arithmetically trivial to write MOS = 1 + 4 × VMAF/100 and feel done. Do not trust it. The real relationship between a picture metric and a subjective score is fitted against human ratings and is non-linear — the same one-point VMAF change means more near the middle of the scale than near the top, where roughly 6 VMAF points correspond to one just-noticeable difference (1 JND). A linear rescale invents precision the metric does not have. This is exactly why standardized models do not do it: ITU-T P.1203's video module produces a perceptual quality estimate on a 1–5 scale directly from the bitstream and metadata; it does not consume a VMAF number and stretch it. Whenever you map a picture metric onto an experience scale, treat the mapping as an approximation to be validated against subjective scores — the topic of validating metrics against human scores — and never present the rescaled number as ground truth.
Three ways to connect picture quality to QoE
There is no single blessed formula. There are three families of method, in increasing order of rigor. Pick by how much you need to defend the number.
1. The additive combine — the ABR QoE objective
The simplest honest combine is the objective function that adaptive-streaming research already optimizes (used in the MPC and Pensieve work on ABR). It is a sum over the session's segments:
QoE = Σ q(R_n) − μ · Σ T_n − Σ |q(R_{n+1}) − q(R_n)|
picture stalls switching
In words: add up the picture quality of every played segment, subtract a penalty for every second of rebuffering, and subtract a penalty for every change in quality between segments. The term q(R_n) is the quality of the rendition shown in segment n; T_n is the rebuffer time charged to that segment; μ is the stall-penalty weight. The bridge to picture metrics is the quality term: instead of using raw bitrate for q, map each played rendition to its measured VMAF, then put VMAF on a quality scale. (We use q = VMAF / 20, a 0–5 point scale, and state plainly that this is an illustrative map, not a validated VMAF-to-MOS curve.) The stall weight μ = 4.3 is carried from the bitrate-switching trade-off article, where one second of stall costs about as much as a whole top-quality segment.
Worked example: three strategies, one program
Take the same 100-second program (ten 10-second segments) delivered three ways. Map renditions to quality with q = VMAF/20, so VMAF 96 → 4.8 points, 88 → 4.4, 80 → 4.0, 72 → 3.6.
Session A — chase the picture. An aggressive ladder targets the top rung (VMAF 96), but the network cannot hold it, so ABR drops to a VMAF-80 rendition three times and the stream stalls twice for one second each. Played VMAF sequence: 96, 96, 80, 96, 80, 96, 96, 80, 96, 96.
- Picture sum: seven 96s and three 80s →
4.8 × 7 + 4.0 × 3 = 33.6 + 12.0 = 45.6 - Stall penalty: two 1-second stalls →
4.3 × 2.0 = 8.6 - Switching penalty: six switches, each a 0.8-point step →
0.8 × 6 = 4.8 - QoE = 45.6 − 8.6 − 4.8 = 32.2
Session B — balanced. A conservative ladder holds a stable VMAF-88 rendition for the whole session: no stalls, no switches. Played VMAF: 88 ten times.
- Picture sum:
4.4 × 10 = 44.0 - Stall penalty:
0; switching penalty:0 - QoE = 44.0
Session C — over-cap the picture. Terrified of stalls, the team pins a low VMAF-72 rendition: perfectly smooth, but needlessly soft. Picture sum 3.6 × 10 = 36.0, no penalties → QoE = 36.0.
Now read the result. Ranked by the picture the viewer actually saw, Session A wins: its time-weighted played VMAF is 91.2, above B's 88.0 and C's 72.0. Ranked by experience, the order flips: B (44.0) beats C (36.0) beats A (32.2). The best-looking session is the worst experience, because two short stalls and six switches cost A more than its picture advantage was worth. And C — the team that "fixed" QoE by capping quality — loses to B because it gave away picture it did not need to. The session that respected both axes wins. That is the synthesis in one table: optimizing either axis alone backfires.
Figure 2. The same program, three strategies. Session A has the highest played picture (91.2 VMAF) but the lowest QoE; the balanced Session B wins. Picture-only ranking and QoE ranking disagree.
You can reproduce every number above — and score your own sessions — with the picture-quality-to-QoE calculator shipped with this article; its --demo flag prints this exact example.
2. The standardized model — ITU-T P.1203
When the number has to be citable and comparable across teams, use the standard. ITU-T P.1203 (2017) was the first international standard for the QoE of HTTP adaptive streaming, scoring sessions of 1–5 minutes on a 1–5 MOS scale. It is built from modules. The video module (P.1203.1) and audio module (P.1203.2) each produce a short-term, per-one-second quality estimate. The quality integration module (P.1203.3) — the part that matters here — folds those per-segment qualities together with the initial loading delay, the stalling events, and the quality switches, and accounts for temporal effects such as the recency that makes the end of a session weigh more heavily. The output is a single session MOS.
P.1203 runs in four modes (0–3) defined by how much information you can give it. Mode 0 takes only metadata — codec, bitrate, resolution, frame rate, and the client-side stalling events. Higher modes take frame-level and bitstream information for a more accurate per-segment quality. This matters for the synthesis: in the lower modes the picture-quality term is estimated from metadata, not measured from pixels, so a metadata-only QoE score will miss content-dependent quality differences that a real VMAF measurement would catch.
That gap is where modern picture models plug in. ITU-T P.1204.3 (2020) is a bitstream-based short-term video-quality model, validated for H.264, H.265, and VP9 up to 4K, and it is designed to be complementary to P.1203.1 — it can serve as the per-segment quality engine the integration module consumes. There is also a maintained open-source reference implementation of P.1203, with a center-cropped VMAF variant ("Cencro") used during its development, so connecting a VMAF-derived presentation quality to a standardized session score is an established practice, not a thought experiment.
Figure 3. The ITU-T P.1203 integration stack: per-segment video and audio quality enter the Pq integration module, which folds in initial delay, stalling, switching, and recency to produce one session MOS (1–5).
The calculator shipped here includes a transparent, clearly-labeled P.1203-style session MOS so you can feel how the pieces combine — but it is an illustrative approximation, not the standard. The real P.1203 coefficients are normative; for a standardized score, use the reference implementation rather than a homemade reduction.
3. The research frontier — interaction and memory
The additive objective and the standard both largely add the picture term and the delivery penalties. Academic QoE models go further by modeling how the two interact over time. The Streaming QoE Index (SQI; Duanmu, Zeng, Ma, Rehman & Wang, IEEE JSTSP, 2017) combines an instantaneous presentation-quality measure (a full-reference metric like VMAF) with the stalling experience and the interaction between them — a stall that lands right after a quality drop hurts more than the same stall after a clean stretch. SQI was built on a subjective study of compression, initial buffering, and mid-stream stalling, and it outperformed earlier models that used only bitrate and stalling statistics. Its successor, KSQI, adds quality adaptation (switching) under monotonicity constraints. The continuous-time work of Bampis and Bovik (IEEE, 2017–2018) feeds VMAF-style quality, rebuffering, and an explicit memory term into recurrent and dynamic models to predict QoE moment by moment, capturing the recency and primacy effects that a single session average smooths away. You do not need to deploy these models to benefit from them — they tell you which effects a serious QoE number must respect: the interaction between picture and stalls, and the viewer's memory.
Comparing the three approaches
Figure 4. Three ways to connect picture quality to QoE, with what each combines and where each one lies.
| Approach | What it combines | Output | What it measures well | Where it lies / blind spot |
|---|---|---|---|---|
| ABR QoE objective (MPC/Pensieve) | Played-rendition quality (map bitrate→VMAF), rebuffer seconds, switch magnitude | A unitless score (relative) | Fast, transparent, tunable; great for ranking delivery strategies | Coefficients (μ) are hand-set; linear; ignores interaction and memory; not a MOS |
| ITU-T P.1203 | Per-segment video + audio quality, initial delay, stalling, switching, recency | Session MOS 1–5 (standardized) | Citable, comparable, validated; models recency | Picture term is estimated (not VMAF) in low modes; licensed coefficients; 1–5 min sessions |
| SQI / KSQI / continuous (Waterloo, UT Austin) | Presentation quality (VMAF) + stalling + switching + their interaction + memory | Predicted MOS / continuous score | Models interaction and memory; highest correlation with the eye | Heavier to deploy; needs the reference picture metric; research-grade, not a deployed standard |
All three share one input you must get right: the picture term must come from the played renditions, mapped through a measured VMAF, not the top-rung score. And all three must ultimately be validated against a properly run subjective test — the eye is the ground truth, as covered in why subjective testing is ground truth. When a model and a careful viewing disagree, the viewing wins (ITU-R BT.500-15, 2023).
Common mistakes
Reporting VMAF as "the QoE." VMAF scores the encode and is blind to stalls, switches, and startup. A dashboard that shows only VMAF is reporting picture quality, not experience — say so, and pair it with the delivery metrics.
Reading QoE off the top-rung VMAF. The viewer saw the played renditions, not the best one you encoded. Use the time-weighted played VMAF; the gap to the top rung is quality ABR gave away.
Linearly rescaling VMAF 0–100 to MOS 1–5. The real mapping is fitted and non-linear (~6 VMAF ≈ 1 JND). A linear stretch invents precision; treat any mapping as an approximation to validate.
Averaging picture and delivery scores with equal weight. They are not on one scale, and the combine is non-linear and recency-dependent. Use an objective function or a standardized model, not a 50/50 blend.
Optimizing one axis. A fat ladder lifts VMAF and causes rebuffering on real networks; a low cap kills rebuffering and softens the picture. Both backfire — Sessions A and C in the worked example.
Where Fora Soft fits in
Fora Soft has built video streaming, OTT, conferencing, e-learning, and telemedicine systems since 2005, and the question clients ask is rarely "what is our VMAF?" — it is "are viewers having a good experience?" Answering it means measuring the encode (per-rendition VMAF, under the dated, reproducible method in our benchmark methodology) and the session (player telemetry) and reconciling them into one view, exactly as this article describes. For a live conferencing or telemedicine stream there is no master to compare against, so the picture side shifts to no-reference estimation — the subject of no-reference quality for live and UGC. The discipline is the same everywhere: never let one number stand in for the whole experience.
What to read next
- Streaming QoE: the metrics that predict whether a viewer stays
- VMAF explained: Netflix's perceptual metric
- No-reference quality for live and UGC
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your picture quality vs qoe plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- International Telecommunication Union. Recommendation ITU-T P.1203: Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport (2017). The integration model — per-segment quality, initial delay, stalling, switching, and recency folded into a 1–5 session MOS, in modes 0–3. https://www.itu.int/rec/T-REC-P.1203
- International Telecommunication Union. Recommendation ITU-T P.1203.3: Quality integration module (2017, with Corrigendum). The Pq module that combines audio/video quality with stalling and switching over time. https://www.itu.int/rec/T-REC-P.1203.3
- International Telecommunication Union. Recommendation ITU-T P.1204.3: Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full bitstream information (2020). A bitstream-based short-term video-quality model, complementary to P.1203.1. https://www.itu.int/rec/T-REC-P.1204.3-202001-I
- Netflix Technology Blog. VMAF: The Journey Continues (2018). States VMAF's scope (compression and scaling artifacts) and its limits (no rebuffering, no recency/primacy). https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12
- Netflix, Inc. VMAF — Video Multi-Method Assessment Fusion (reference implementation and documentation, accessed 2026). The full-reference, per-frame picture metric and its fused features. https://github.com/Netflix/vmaf
- International Telecommunication Union. Recommendation ITU-R BT.500-15: Methodologies for the subjective assessment of the quality of television pictures (2023). The subjective ground truth a model or composite QoE score must be validated against. https://www.itu.int/rec/R-REC-BT.500
- Z. Duanmu, K. Zeng, K. Ma, A. Rehman, Z. Wang. A Quality-of-Experience Index for Streaming Video. IEEE Journal of Selected Topics in Signal Processing, 11(1), 2017. Combines presentation quality with stalling and their interaction (SQI). https://ece.uwaterloo.ca/~z70wang/publications/JSTSP_Streaming_QoE.pdf
- C. G. Bampis, Z. Li, A. C. Bovik. Recurrent and Dynamic Models for Predicting Streaming Video Quality of Experience. IEEE Transactions on Image Processing, 2018. Feeds VQA (VMAF), rebuffering, and memory into dynamic models for continuous QoE. https://live.ece.utexas.edu/publications/2018/bampis2018recurrent.pdf
- W. Robitza, S. Göring, A. Raake, et al. HTTP Adaptive Streaming QoE Estimation with ITU-T Rec. P.1203 — Open Databases and Software. ACM MMSys, 2018. The open-source P.1203 implementation and the "Cencro" center-cropped VMAF used in development. https://github.com/itu-p1203/itu-p1203
- X. Yin, A. Jindal, V. Sekar, B. Sinopoli. A Control-Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP (MPC). ACM SIGCOMM, 2015. The additive QoE objective (quality − stall penalty − switching penalty) used in §3.1. https://www.cs.cmu.edu/~junchenj/control.pdf
- F. Dobrian, et al. Understanding the Impact of Video Quality on User Engagement. ACM SIGCOMM, 2011. Rebuffering is the single largest engagement factor across content types. https://www.cs.cmu.edu/~hzhang/papers/sigcomm2011_QualityEngagement.pdf
- S. S. Krishnan, R. K. Sitaraman. Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs. ACM IMC, 2012. Causal evidence that startup delay and rebuffering reduce engagement. https://people.cs.umass.edu/~ramesh/Site/PUBLICATIONS_files/imc234-krishnan.pdf


