Why this matters
You can ship a perfect encode and still lose the viewer. A high VMAF score proves the compressed picture looks close to the source, but it says nothing about whether the stream started fast, played without stalling, or held its resolution — and those are the things a viewer feels first. This article is for the streaming lead, the player engineer, the product owner, and the QA engineer who own retention and need to know which numbers actually predict whether someone keeps watching. It is the framing article for Block 6: it names the metrics, shows what the engagement research found, and points you to the deep dive on each. Read it before the others so you know why these particular numbers earn a place on your dashboard.
The picture is only half of quality
For most of this section, "quality" has meant the picture: how close a compressed frame looks to its source, scored by a metric like PSNR, SSIM, or VMAF. That is the right question at encode time, where you hold the original and can measure the loss frame by frame. But the viewer never sees your encode in a lab. They see it streamed — started on a tap, buffered over a network, switched up and down in resolution by an adaptive player, and sometimes frozen mid-scene while the buffer refills. All of that happens after the encoder is done, and none of it shows up in a picture metric.
Quality of experience, or QoE — the total quality the viewer actually perceives, end to end — is the name for that fuller picture. It is deliberately distinct from quality of service, or QoS, which is what the network and pipeline deliver: bitrate, packet loss, latency, throughput. The distinction is the whole point of QoE versus QoS: a network can hit every QoS target and still produce a miserable QoE, because the viewer does not experience megabits — they experience a spinner. QoE is measured at the viewer's end of the wire, from events the player reports, and it is the layer that decides whether your audience stays.
Here is the claim this block is built on, stated plainly: an average-looking picture that plays instantly and never stalls beats a gorgeous picture that takes six seconds to start and rebuffers twice. The picture metric and the experience metric can point in opposite directions, and when they do, the experience wins the viewer. That is why a streaming team measures both, and why this block exists alongside the picture-quality blocks.
Figure 1. Picture quality is necessary but not sufficient. A picture metric scores the encode; QoE adds the delivery experience — startup, stalls, bitrate, switching, and failures — and only the two together predict whether a viewer stays.
The four metrics that move engagement
Player telemetry can report dozens of fields, but four families of metric do almost all the work of predicting whether a viewer stays. Define each one before its name, the way you would explain it to a colleague who has never run a streaming analytics dashboard.
The time between the moment a viewer presses play and the moment the first frame appears, called startup time (also video start time, or join time), is the first impression. It decides whether the session ever really begins. Measuring it, and the trade-off between a fast start and a sharp first frame, is the subject of startup time and time-to-first-frame.
The fraction of a session the viewer spends staring at a spinner instead of moving video, called the rebuffering ratio (total stall time divided by total session time), is the stall metric. It is the one most strongly tied to people leaving. What it is, how to measure it, and the documented link to abandonment are the subject of rebuffering ratio and the cost of a stall.
The picture detail the viewer actually received — the delivered bitrate the adaptive player settled on, and how often it switched that bitrate up or down — is the steadiness metric. A higher average bitrate usually looks better, but frequent switching is its own visible artifact. The balancing act an adaptive-bitrate (ABR) player performs between these is the subject of bitrate, switching, and the ABR quality trade-off.
The sessions that fail outright — playback that never starts (a video start failure), the viewer who leaves during startup (an exit before video start), or a fatal error mid-play (a video playback failure) — are the failure metrics. They are the bluntest signal of all, because a failed session is a viewer who got nothing. Where all of these numbers come from — the player instrumentation and the analytics stack — is the subject of player-side quality metrics.
These names are not ad hoc. The industry standardized them so that "startup time" or "rebuffering ratio" means the same thing across players and analytics vendors: the recommended practice CTA-2066 defines the streaming QoE events, properties, and metrics — including video start time, video start failure, exits before video start, and rebuffering ratio — and how each should be computed (CTA-2066, 2020). The companion CTA-5004 (Common Media Client Data, CMCD) standardizes the fields a player attaches to each request — encoded bitrate, buffer length, measured throughput, content and session IDs — so the same data reaches any backend (CTA-5004, 2020; the CMCDv2 revision CTA-5004-A followed in 2026). Use the standard definitions; otherwise two dashboards that both say "rebuffering ratio" are quietly measuring different things.
Figure 2. A single session, read as QoE. The timeline shows the startup delay before the first frame, a mid-play stall (rebuffering), and two bitrate switches; the four metric families are derived from these events, not from the pixels.
What the engagement research actually found
The reason these four metrics belong on a dashboard is not intuition — it is measured, and in one case it is causal. Two studies anchor the field.
The first, from Conviva and collaborators across a large dataset of short video-on-demand, long video-on-demand, and live content, asked which quality metric most affects engagement. Its headline finding: the rebuffering ratio has the largest impact on engagement across all content types (Dobrian et al., SIGCOMM 2011). The magnitude depends on content — live is the most sensitive — and the paper gives a concrete figure: for a 90-minute live event, a 1% increase in buffering ratio reduced engagement by more than three minutes of viewing (Dobrian et al., 2011). It also found that average bitrate matters more for live than for on-demand content, which is why a live product cannot simply copy a VoD playbook.
The second study went further and established cause, not just correlation, using a quasi-experimental design on a large Akamai dataset (Krishnan & Sitaraman, IMC 2012). Three of its findings are worth committing to memory. Viewers begin to abandon a video once startup passes about two seconds, and each additional second of startup delay raised the abandonment rate by roughly 5.8 percentage points in the range they measured. A viewer who experienced a rebuffer delay equal to just 1% of the video's duration watched about 5% less of it than a comparable viewer with no rebuffering. And a viewer who hit a failure was 2.32% less likely to return to the same site within a week — the damage outlasts the session. The study also found that tolerance varies: viewers of short clips abandon faster than viewers of long content, and viewers on better-connected devices are less patient, not more.
Put those two studies together and the priority order falls out on its own. Stalls hurt most, the start is the moment you can lose people before they have watched anything, and a failure poisons the next visit too. None of these is visible in a picture metric.
Figure 3. Startup delay versus abandonment, after Krishnan & Sitaraman (2012). Abandonment stays low until roughly two seconds, then rises by about 5.8 percentage points for each additional second of delay in the measured range — the cost of a slow start, before a single frame is seen.
The arithmetic of a slow start
Work the startup finding through to a number a product owner can feel. Suppose your player takes six seconds to show the first frame on a cold start. That is four seconds beyond the roughly two-second point where viewers begin to leave. At about 5.8 percentage points of added abandonment per extra second, the linear approximation is:
extra_abandonment = 5.8 percentage points/second x (6s - 2s)
= 5.8 x 4
= 23.2 percentage points
On a day with 1,000,000 play attempts, that is roughly:
lost_sessions = 1,000,000 x 0.232
= 232,000 sessions abandoned at the door
compared with a two-second start. The number is an approximation — the 5.8-points-per-second slope holds in the range Krishnan and Sitaraman measured, not forever, and your audience and content differ — but the order of magnitude is the point. Shaving four seconds off startup is not a cosmetic tweak; it is the difference between starting a few hundred thousand sessions and losing them.
Now the rebuffering side. Take a 40-minute program, which is 2,400 seconds. A rebuffering ratio of 1% means the viewer spent 24 seconds staring at a spinner. The Krishnan–Sitaraman result says that viewer watches about 5% less of the program:
watch_time_lost = 5% x 40 minutes
= 2 minutes of viewing lost per affected viewer
Two minutes per viewer, multiplied across an audience and a catalogue, is a large amount of engagement — and ad impressions, and subscriber goodwill — erased by 24 seconds of stall. This is why the rebuffering ratio earns its own full article (the cost of a stall) and why "never stall" usually beats "look slightly sharper" as an optimization target.
Why no single number is QoE
It is tempting to crown one metric — rebuffering, say — and optimize it alone. The research warns against it. The metrics interact, and the relationship between any one of them and the experience is non-linear and content-dependent: predicting QoE well requires combining them, because a model built on a single metric leaves most of the variation unexplained (Balachandran et al., SIGCOMM 2013). The clearest example of interaction is the one an adaptive-bitrate player fights every second: to avoid a stall it can drop to a lower bitrate, trading picture detail for smoothness; to raise the bitrate it risks outrunning the network and stalling. Startup time trades off the same way — start at a low bitrate and playback begins sooner but looks soft; wait for a higher bitrate and the first frame is sharp but later. Optimizing one metric in isolation simply pushes the damage into another.
This is also where the picture metrics from earlier blocks rejoin the story. A delivered-picture score like VMAF tells you how good the received resolution looks; the QoE metrics tell you whether the viewer got to see it without stalling. Neither is complete alone. Combining them — picture quality weighted against startup, stalls, and switching — into one view of experience is hard enough to deserve its own treatment in connecting objective picture metrics to perceived QoE. The parametric model ITU-T P.1203 does exactly this kind of fusion: it predicts a 1-to-5 mean opinion score for an adaptive-streaming session from the bitstream and metadata, folding in stalls and quality switching as well as the picture (ITU-T Rec. P.1203, 2017). The takeaway for this framing article is simpler: hold the four metrics together, because the viewer experiences them together.
The QoE metrics at a glance
The table below names each metric, what it measures, its unit, and — in the spirit of this section — where it can mislead you. Every QoE metric has a blind spot, and a dashboard that ignores them tells a falsely calm story.
| Metric | What it measures | Unit | Where it lies / blind spot |
|---|---|---|---|
| Startup time | Delay from press-play to first frame | seconds | A fast start at a low bitrate hides a soft first impression; averages hide the slow tail |
| Rebuffering ratio | Share of the session spent stalled | % of session time | A low ratio can still mean one painful stall at the worst moment; mean over a session hides timing |
| Delivered bitrate | The bitrate/resolution the viewer received | kbps / resolution | Higher is not always better-looking per content; says nothing about stalls |
| Bitrate switching | How often/large the resolution changes | switches per session | Too few switches may mean stalling instead; too many is its own visible artifact |
| Video start failure | Playback that never began | % of attempts | Blunt — counts the failure but not the near-misses that almost failed |
| Exits before video start | Viewers who left during startup | % of attempts | Mixes impatience with genuine failure; needs startup-time context to read |
Table 1. The streaming QoE metric families, defined per CTA-2066 (2020). The deep dive on each lives in its own Block 6 article; this table is the map.
Figure 4. The same QoE metrics as a visual reference — what each measures, and where it can mislead you.
A common mistake: optimizing the picture and ignoring the wire
The most expensive mistake a quality program makes is to measure picture quality beautifully and the experience not at all. A team pours effort into raising VMAF two points on the encode, ships it, and watches retention stay flat — because the bottleneck was a six-second startup and a 1.5% rebuffering ratio that no picture metric was ever going to reveal. The encode was never the problem.
Four related traps follow from the same root. Reporting a global average hides the localized cliff: a calm fleet-wide rebuffering ratio can conceal one CDN, device, or region where it has spiked — the same segmentation discipline that monitoring quality in production insists on. Treating one metric as QoE ignores the interactions above. Confusing QoS with QoE — assuming healthy network numbers mean a happy viewer — skips the actual experience. And confusing encode quality with delivered quality measures the file you produced instead of the stream the viewer received. Each trap substitutes a number you can hit for the number that matters, and the viewer pays the difference.
Where Fora Soft fits in
Fora Soft has built video streaming, OTT, conferencing, e-learning, telemedicine, and surveillance products since 2005, and in every one of them the question that decides success is the same: did the viewer stay? We help teams instrument the four metric families this article names — startup time, rebuffering, bitrate and switching, and failures — using the standardized player events (CTA-2066) and client data (CMCD) so the numbers mean the same thing across the stack. For live products such as conferencing and live OTT, where startup and stalls dominate the experience and there is no pristine reference to compare against, that telemetry is the primary quality signal from day one. The goal is not a prettier dashboard; it is catching the slow start or the stall that costs you the audience, before the audience leaves.
What to read next
- Rebuffering Ratio and the Cost of a Stall — the single most important QoE metric, in full.
- Startup Time and Time-to-First-Frame — why the first few seconds decide retention.
- Connecting Objective Picture Metrics to Perceived QoE — fusing VMAF with the delivery metrics into one view.
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your streaming qoe metrics plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- F. Dobrian, V. Sekar, A. Awan, I. Stoica, D. Joseph, A. Ganjam, J. Zhan, H. Zhang. "Understanding the Impact of Video Quality on User Engagement." ACM SIGCOMM, 2011. Tier 5 (peer-reviewed, large industry dataset — Conviva). Across short VoD, long VoD, and live content, the buffering ratio has the largest impact on engagement of all quality metrics; for a 90-minute live event a 1% increase in buffering ratio cut engagement by more than three minutes; average bitrate matters more for live than VoD. Basis for the "rebuffering hurts most" and live-sensitivity claims. https://www.cs.cmu.edu/~hzhang/papers/sigcomm2011_QualityEngagement.pdf
- S. S. Krishnan, R. K. Sitaraman. "Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs." ACM IMC, 2012. Tier 5 (peer-reviewed, causal study, large Akamai dataset). Viewers abandon past ~2s of startup delay, with each additional second adding ~5.8 percentage points of abandonment; a rebuffer delay of 1% of duration cut watch time ~5%; a failed visit cut the chance of returning within a week by 2.32%. Basis for the startup-abandonment and rebuffering-watch-time arithmetic. https://people.cs.umass.edu/~ramesh/Site/HOME_files/imc208-krishnan.pdf
- A. Balachandran, V. Sekar, A. Akella, S. Seshan, I. Stoica, H. Zhang. "Developing a Predictive Model of Quality of Experience for Internet Video." ACM SIGCOMM, 2013. Tier 5 (peer-reviewed). QoE metrics are interdependent and relate to experience non-linearly; predicting QoE well requires combining metrics rather than relying on any single one. Basis for the "no single number is QoE" section. https://dl.acm.org/doi/10.1145/2486001.2486025
- CTA-2066. "Streaming Quality of Experience Events, Properties and Metrics." Consumer Technology Association, 2020. Tier 1 (standard). Defines streaming QoE events and metrics — video start time, video start failure, exits before video start, rebuffering ratio — and how each is computed, for consistent cross-vendor reporting. Basis for the metric definitions and Table 1. https://shop.cta.tech/products/streaming-quality-of-experience-events-properties-and-metrics-cta-2066
- CTA-5004. "Web Application Video Ecosystem — Common Media Client Data (CMCD)." Consumer Technology Association, 2020 (CMCDv2 / CTA-5004-A, 2026). Tier 1 (standard). The common key-value vocabulary a player attaches to each media request (encoded bitrate, buffer length, measured throughput, content/session IDs). Basis for the standardized-telemetry point. https://cdn.cta.tech/cta/media/media/resources/standards/pdfs/cta-5004-final.pdf
- Recommendation ITU-T P.1203. "Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport." International Telecommunication Union, 2017. Tier 1 (official standard). Predicts a 1–5 MOS for an HTTP-adaptive-streaming session from bitstream and metadata, integrating stalls and quality switching as well as the picture. Basis for the QoE-fusion point. https://www.itu.int/rec/T-REC-P.1203
- R. K. Sitaraman. "Network Performance: Does It Really Matter to Users and By How Much?" COMSNETS, 2013. Tier 5 (peer-reviewed). Synthesizes the Akamai viewer-behavior results — startup delay, rebuffering, and failure effects on abandonment and repeat viewership — into the business case for measuring delivery quality. Basis for the engagement-and-retention framing. https://people.cs.umass.edu/~ramesh/Site/HOME_files/comsnets13.pdf
- Conviva. "OTT 101: Your Guide to Streaming Metrics that Matter." Conviva, accessed 2026-06-25. Tier 4 (vendor, credible deployer at scale). The operational QoE metric taxonomy as deployed — rebuffering across dimensions, video start failure (VSF), video playback failure (VPF), and the link between a streaming performance index and time watched. Basis for the failure-metric definitions and the current industry framing. https://www.conviva.ai/resource/ott-101-your-guide-to-streaming-metrics-that-matter/
- Recommendation ITU-R BT.500-15. "Methodologies for the subjective assessment of the quality of television pictures." International Telecommunication Union, 2023. Tier 1 (official standard). The subjective ground truth every objective and QoE model is ultimately validated against; when a model and a careful viewing disagree, the viewing wins. Basis for the measurement-honest framing. https://www.itu.int/rec/R-REC-BT.500


