Integrating Video Quality Measurement Into CI/CD

Why this matters

A quality gate that only runs on an engineer's laptop is not a gate — it is a good intention. The value of a perceptual quality check appears only when the pipeline runs it automatically, on every change, and produces the same number no matter which machine ran it. This article is for the platform or DevOps engineer, the encoding lead, and the QA engineer who has to take the gate logic from quality gates in CI/CD and actually wire it into GitHub Actions, GitLab CI, or Jenkins so it is reproducible, fast, and cheap. Get the wiring wrong and you get the worst of both worlds: a slow pipeline that fails on noise from an unpinned tool, which the team learns to ignore. Get it right and the measurement disappears into the background, surfacing only when a change really did move the picture.

The wiring problem, stated plainly

A short companion to the gate article: that piece decides what to check — gate on a perceptual metric such as VMAF (Netflix's quality score, 0–100, where higher is closer to the original to a human eye), run an absolute floor plus a regression check, and size the margin from the metric's own noise. This article answers the next question: how do you run that check on every change, get the same answer on every machine, and not bankrupt your build minutes doing it? That is an engineering problem, not a measurement one, and it has four parts — separation, reproducibility, caching, and reporting. The rest of the article takes them in order.

Before any of them, one rule frames the whole design. A metric comparison is only valid apples-to-apples: the same frames, the same resolution, the same reference, the same metric, and the same model version. Every wiring decision below exists to keep those five things constant across a thousand automated runs, so the only thing that ever moves the score is a real change in the encode — which is exactly what the pipeline is there to catch. The gate this wiring runs is one checkpoint inside a broader automated quality-control architecture; for where it sits among ingest validation and pre-publish checks, see automated quality control in a video pipeline.

Step one: separate the encode, the measure, and the decision

The most common wiring mistake is to fuse measurement into the encoder, so the only way to re-measure is to re-encode. Keep the three jobs distinct: a stage that encodes the test clip with the change, a stage that measures the result against a reference, and a stage that decides pass or fail. Each writes its output to a known location and the next reads it.

Netflix learned this at scale and rebuilt its pipeline around it. Its Cosmos platform deliberately decouples video-quality measurement from encoding, precisely so the team can update a metric or roll out a new VMAF version without re-encoding the catalogue (Netflix Technology Blog, Video Quality at Scale with Cosmos Microservices, 2021). The same principle scales down to a small pipeline: if measurement is its own stage reading an encoded file, you can re-score yesterday's encodes under a new model without touching the encoder, and you can run the measurement on a different machine class from the encode.

Three pipeline stages — encode, measure, decide — each writing to a known location, with measurement decoupled from encoding Figure 1. Wire the pipeline as three separable stages. The encoder produces renditions; a separate measurement stage scores them against a pinned reference and writes JSON; the decision stage reads the JSON and gates. Decoupling measurement from encoding lets you re-score under a new model without re-encoding.

Separation also lets you match the trigger to the cost. A fast subset — one or two representative clips, the top renditions only — runs on every pull request, where it has to be quick. The full ladder across the whole reference set runs nightly or before a release, where minutes matter less. The pipeline is the same three stages; only the breadth of the input changes.

Step two: containerize the measurement so every runner agrees

This is the heart of CI integration, and the reason a quality check that passed locally fails mysteriously in the pipeline. A VMAF score depends on three things that must not float: the FFmpeg build, the libvmaf library inside it, and the model file. Change any one and the number shifts, silently re-baselining every threshold under you (Netflix/VMAF, models documentation, 2026). On a developer laptop those are whatever Homebrew installed last; on a CI runner they are whatever the base image shipped. Two machines, two FFmpeg versions, two different scores for the same encode — and the gate reads the difference as a regression that does not exist.

The fix is to treat the measurement environment as a versioned, immutable build input: bake a pinned FFmpeg, a pinned libvmaf, and a pinned model into a container image, and run every measurement inside it. A container is a sealed box holding the exact tool versions, so the same box produces the same number on a laptop, a hosted runner, or a render farm. The current stable line is FFmpeg 8.1.2 (released 2026-06-17), and static builds such as BtbN's ship with libvmaf already enabled; pin to a digest, not a moving tag.

# Dockerfile.measure — a sealed, reproducible measurement environment
FROM ubuntu:24.04@sha256:<pinned-digest>

# Pinned FFmpeg 8.1.2 static build (libvmaf enabled). Pin the URL/version.
RUN apt-get update && apt-get install -y --no-install-recommends curl xz-utils ca-certificates \
 && curl -L -o /tmp/ffmpeg.tar.xz \
      https://github.com/BtbN/FFmpeg-Builds/releases/download/autobuild-2026-06-18/ffmpeg-n8.1.2-linux64-gpl-8.1.tar.xz \
 && tar xf /tmp/ffmpeg.tar.xz --strip-components=2 -C /usr/local/bin --wildcards '*/bin/ffmpeg' '*/bin/ffprobe' \
 && pip install --no-cache-dir ffmpeg-quality-metrics==3.* \
 && ffmpeg -version | head -1

# Pin the VMAF model into the image so it can never drift at run time.
COPY models/vmaf_v0.6.1.json /opt/models/vmaf_v0.6.1.json

The measurement command the container runs is the subject of measuring quality with FFmpeg and libvmaf; the general FFmpeg invocation mechanics belong to the Video Encoding section's FFmpeg cheat sheet. This article only insists that whatever command you run, you run it from a pinned image.

One subtlety the container cannot hide for you: the CPU and GPU paths through VMAF do not produce bit-identical scores. The math runs in a slightly different order on a graphics processor, so the same files and the same model can differ by a fraction of a point between the libvmaf (CPU) and libvmaf_cuda (GPU) filters (Mason K, DEV, 2026). Pick one path for the gate and stay on it; mixing them across builds reintroduces the very drift the container removed. Record which path, which FFmpeg version, and which model in the run's metadata, the same way our benchmark methodology stamps provenance on every published number.

A pinned container holding FFmpeg, libvmaf, and the model file produces the same score on a laptop, a hosted runner, and a GPU farm Figure 2. Containerize the measurement. The FFmpeg build, the libvmaf library, and the model file are the three inputs that move a VMAF score; pin all three in an immutable image so every runner agrees. The CPU and GPU filters differ slightly — choose one and record it.

Step three: cache the reference and the model, never re-fetch live

The container pins the tools; the inputs to the measurement — the reference master and the model file — need the same treatment, for a different reason. The reference clip (the pristine original VMAF compares against, often called the golden reference; see regression testing and golden references) is frequently large, and the lazy wiring fetches it from a public URL at the start of every run. That turns a content-delivery-network hiccup or a moved file into a red build that has nothing to do with the encode. The same risk applies to a model downloaded from a repository on every run.

Store both as versioned, cached artifacts. Hold the reference master in object storage or Git Large File Storage (Git LFS, an extension that keeps big binaries out of the main repository), and pull it through the CI cache keyed by a content hash so it downloads once and is reused until it actually changes (Mason K, DEV, 2026). The model file is small enough to commit directly or bake into the image, as the Dockerfile above does. Keying the cache by hash, not by name, is what makes the reference immutable: if the file changes, the key changes, the cache misses, and the new baseline is explicit rather than silent.

# Cache the reference master by content hash so it downloads once.
- name: Restore reference master
  uses: actions/cache@v4
  with:
    path: reference/
    key: ref-master-${{ hashFiles('reference/master.sha256') }}

The payoff is that the only network dependency left in the hot path is the container pull, which CI caches by digest as well. A build can then fail for exactly one reason on the measurement side: the encode changed the picture.

Step four: handle scale — fan out, and know when to reach for a GPU

A pull-request check is cheap; a full-catalogue re-measure is not, and conflating the two is how teams end up either skipping measurement or paying for a GPU they did not need. The arithmetic settles it.

Per pull request, the cost is trivial. Take a five-rung ladder of a ten-second 1080p clip at 30 frames per second: each rendition is 300 frames, so 5 × 300 = 1,500 frames to measure. On a CPU compute node, single-stream VMAF runs about 176 frames per second at 1080p (NVIDIA, 2024), so the measurement is 1,500 ÷ 176 ≈ 8.5 seconds of compute. Add the encode and you have a check that finishes inside a minute or two on a standard hosted runner. Run it on every change.

At catalogue scale the picture inverts. Re-scoring 1,000 hours of 4K at 30 frames per second is 108 million frames. A saturated dual-CPU 2U node measures 4K VMAF at about 235 frames per second, so 108,000,000 ÷ 235 ≈ 459,600 seconds ≈ 127.6 hours. A 2U server with eight GPUs reaches about 1,424 frames per second, finishing the same job in 108,000,000 ÷ 1,424 ≈ 75,800 seconds ≈ 21.0 hours — roughly a 6× speedup at the same power draw, and on NVIDIA's cost model about $24 versus $97 for the run, near a 75% saving (NVIDIA, 2024). GPU acceleration via the libvmaf_cuda filter has shipped since VMAF 3.0 and FFmpeg 6.1, and production teams report the same order of saving: Snap moved Snapchat Memories quality checks to GPU and re-transcodes only when a score falls short; V-Nova, processing roughly 20,000 encoding jobs a week, measured at least a 2× speedup (NVIDIA, 2024).

So the rule is simple: parallelise, and put the work on the right machine. Fan the renditions out across parallel jobs — CI matrix builds do this natively, one job per rung or per clip — so wall-clock time scales with your runner count, not your ladder depth. Keep the fast per-PR check on cheap hosted CPU runners. Move the heavy nightly or pre-publish passes to self-hosted GPU runners (GitLab exposes this with gpus = "all" in the runner config and an NVIDIA CUDA base image; GitHub supports self-hosted GPU runners the same way).

Where measurement runs	Throughput (single 4K stream)	Best for	Watch out for
Hosted CPU runner	~64 fps	Per-PR check on a small clip subset	Slow and costly on long clips or the full ladder
Self-hosted CPU node (saturated)	~235 fps	Mid-size nightly runs, no GPU available	Needs many parallel processes to saturate; capacity planning
Self-hosted GPU runner (`libvmaf_cuda`)	~178 fps single / ~1,424 fps (8 GPUs)	Full-catalogue re-measure, large nightly sweeps	Scores differ slightly from CPU — never mix paths in one gate

Table 1. Matching the runner to the job. Per-PR checks are seconds of compute and belong on cheap CPU runners; catalogue-scale re-measurement is a GPU-class job. Throughput figures from NVIDIA's L4-vs-dual-Xeon-8480 measurements, 2024 — your hardware will differ, so measure your own.

Fan-out of a rendition ladder across parallel CI jobs, with per-PR work on CPU runners and the nightly full sweep on GPU runners Figure 3. Scale by fanning out. Each rendition becomes a parallel measurement job, so wall-clock time tracks runner count, not ladder depth. The cheap per-PR subset runs on CPU; the expensive full-catalogue sweep runs on GPU, where 4K VMAF is about 6× faster and 75% cheaper.

Putting it together: the pipeline config

With the four pieces in place, the actual configuration is short. The pattern below triggers only when the encoding path changes, runs the measurement inside the pinned container, and hands the result to the decision stage. It is GitHub Actions; GitLab CI and Jenkins express the same three ideas — trigger, containerized measure step, artifact upload — with different syntax.

# .github/workflows/quality.yml
name: quality-measurement
on:
  pull_request:
    paths: ["encoder/**", "ladder/**"]   # only when the encode can change

jobs:
  measure:
    runs-on: ubuntu-24.04
    container: ghcr.io/your-org/ffmpeg-measure@sha256:<pinned-digest>   # step two
    strategy:
      matrix:
        rung: [1080p, 720p, 480p]        # step four: one job per rendition
    steps:
      - uses: actions/checkout@v4
      - name: Restore reference master   # step three
        uses: actions/cache@v4
        with:
          path: reference/
          key: ref-master-${{ hashFiles('reference/master.sha256') }}
      - name: Encode + measure ${{ matrix.rung }}   # step one: encode, then measure
        run: ./ci/measure.sh ${{ matrix.rung }} > metrics-${{ matrix.rung }}.json
      - name: Upload metrics                         # step five: report
        uses: actions/upload-artifact@v4
        with:
          name: metrics-${{ matrix.rung }}
          path: metrics-${{ matrix.rung }}.json
          retention-days: 30

The decision step — comparing each number to its floor and to the golden-reference baseline, sizing the margin from the metric's noise, and exiting non-zero on a hard failure — is deliberately not shown here, because it is the subject of the quality-gate article and ships as a separate, testable script. Keeping measurement and decision in separate steps means you can change a threshold without touching the measurement, and re-run a decision on stored metrics without re-measuring.

Step five: report so the result lands where the work is

A measurement nobody sees changes nothing. Three outputs make the result actionable. First, write the per-frame scores as a build artifact and keep them for at least 30 days: the aggregate says the encode regressed, but the per-frame curve says it regressed for two seconds starting at frame 1,400, which is what an engineer needs to find the cause (Mason K, DEV, 2026). Turning that curve into a picture a reviewer can read at a glance is the job of visualizing quality with heatmaps and plots, and reading the worst frames rather than the mean is covered in pooling per-frame scores. Second, post the summary back to the pull request — the mean, a low percentile, and the delta from the baseline, per rendition — so the number is visible where the change is reviewed, not buried in a log. Third, emit a machine-readable JUnit XML report so the CI interface renders a green or red row per rendition natively, the same way it shows unit tests.

A note on what cannot be measured this way. VMAF is full-reference: it needs the pristine original. A CI gate built on it can only check encodes for which you hold the master, which is most production encoding work. It cannot score a live or user-generated stream that has no reference at the point of measurement — that case needs a no-reference metric and a different pipeline, covered in no-reference quality for live and UGC.

A common mistake: the non-reproducible runner

The wiring failure that wastes the most time is the runner that is not reproducible. A developer measures VMAF 93.4 locally; the CI runner, on a different FFmpeg build with a model fetched fresh from the internet, measures 92.1 for the identical encode; the gate reads a 1.3-point "regression" and blocks a clean change. Nothing regressed — the two machines were running different measurements. The team burns an afternoon chasing an encoder bug that does not exist, then loses trust in the gate. The cure is everything above: pin FFmpeg, libvmaf, and the model into one container; cache the reference by content hash; pick the CPU or GPU path and never mix them; and record the full environment in the run. The second, quieter version of the mistake is running the full-catalogue measurement on every commit — correct, reproducible, and far too slow and expensive, until someone disables it. Match the breadth of the measurement to the trigger: a small subset per PR, the full sweep on a schedule. A gate earns the right to block by being both right and fast.

Where Fora Soft fits in

Fora Soft has built video streaming, OTT, conferencing, e-learning, surveillance, and telemedicine systems since 2005, and on the products with their own encoding pipeline the measurement wiring described here is what keeps a routine code change from quietly degrading the picture in production. We help teams stand up the reproducible measurement stage — a pinned FFmpeg-and-libvmaf container, a content-hashed reference cache, a fan-out across the rendition ladder, and a report that posts back to the pull request — and connect it to the gate logic that decides pass or fail. Where a project needs the rate-quality data behind a specific threshold for a given codec and content type, our measured benchmarks (see our benchmark methodology) supply it. The aim is the same one good unit tests have: a check that is silent when the encode is fine and unambiguous when it is not.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your integrating quality measurement into ci/cd plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.

References

Netflix/VMAF project. "Using VMAF with FFmpeg" (resource/doc/ffmpeg.md). Accessed 2026-06-26. Tier 1 (metric-author primary). The libvmaf and libvmaf_cuda filter syntax, the model files (vmaf_v0.6.1.json and variants), the distorted-then-reference input order, and the scale-to-reference requirement. Basis for the containerized measurement command and the apples-to-apples constraint. https://github.com/Netflix/vmaf/blob/master/resource/doc/ffmpeg.md
Netflix/VMAF project. "VMAF models" (resource/doc/models.md). Accessed 2026-06-26. Tier 1 (metric-author primary). Model selection (default vmaf_v0.6.1, phone, 4K) and the rule that a different model produces different scores. Basis for treating the model file as a pinned build input that silently re-baselines if it floats. https://github.com/Netflix/vmaf/blob/master/resource/doc/models.md
Netflix/VMAF project. "VMAF Confidence Interval" (resource/doc/conf_interval.md), since v1.3.7 (2018). Accessed 2026-06-26. Tier 1 (metric-author primary). The bootstrap models and the score-plus-or-minus-1.96-standard-deviation interval the decision stage uses to size a margin. Basis for delegating the margin maths to the gate article rather than re-deriving it here. https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md
Recommendation ITU-R BT.500-15. "Methodologies for the subjective assessment of the quality of television pictures." International Telecommunication Union, 2023. Tier 1. The subjective-assessment ground truth every objective metric in the pipeline is validated against; the reason the pipeline measures a proxy, not the truth, and a careful viewing overrides it. https://www.itu.int/rec/R-REC-BT.500
Netflix Technology Blog. "Video Quality at Scale with Cosmos Microservices." 2021. Tier 4 (credible deployer). Decoupling video-quality measurement from encoding so metrics and VMAF versions can change without re-encoding the catalogue; chunked, parallel processing across many instances. Basis for the separate-the-stages principle and the fan-out design. https://netflixtechblog.com/netflix-video-quality-at-scale-with-cosmos-microservices-552be631c113
NVIDIA Technical Blog (C. Moluluo, P. Muthana, M. Müller). "Calculating Video Quality Using NVIDIA GPUs and VMAF-CUDA." 2024-03-12. Tier 4 (credible deployer). VMAF-CUDA in VMAF 3.0 and FFmpeg 6.1; the libvmaf_cuda command; single-stream and saturated throughput (64/176 fps CPU, 178/775 fps L4 single; 235/1,034 fps CPU and 1,424/6,200 fps for 8×L4 saturated at 4K/1080p); ~6× speedup, ~75% cost saving ($24 vs $97 per 1,000 hr 4K); Snap and V-Nova deployments. Basis for the scale arithmetic and Table 1. https://developer.nvidia.com/blog/calculating-video-quality-using-nvidia-gpus-and-vmaf-cuda/
Mason K. "Wiring VMAF (and PSNR) into your encoder CI with FFmpeg 8.1 and ffmpeg-quality-metrics." DEV Community, 2026. Tier 6 (expert practitioner). A concrete GitHub Actions gate: trigger on encoder paths, the ffmpeg-quality-metrics JSON, cache the reference in S3/Git LFS, treat the model as a build input, keep per-frame JSON 30 days, the CPU-vs-CUDA score drift, and the frame-count/colour-space gotchas. Basis for the config example and the reporting and reproducibility notes. https://dev.to/masonwritescode/wiring-vmaf-and-psnr-into-your-encoder-ci-with-ffmpeg-81-and-ffmpeg-quality-metrics-1g6i
slhck (W. Robitza). "ffmpeg-quality-metrics." GitHub / PyPI, accessed 2026-06-26. Tier 3 (first-party tooling). The Python wrapper that runs PSNR, SSIM, and VMAF in one FFmpeg pass, auto-scales the distorted file to the reference, and emits per-frame and aggregate JSON/CSV. Basis for the measurement step the container runs and the report stage consumes. https://github.com/slhck/ffmpeg-quality-metrics
FFmpeg project. "Download / Releases" (FFmpeg 8.1.2, released 2026-06-17). Accessed 2026-06-26. Tier 3 (first-party tooling). The current stable FFmpeg line pinned in the container; static GPL builds ship libvmaf enabled. Basis for the version pin in the Dockerfile. https://ffmpeg.org/download.html
GitLab. "Using Graphical Processing Units (GPUs)" (GitLab Runner documentation). Accessed 2026-06-26. Tier 3 (first-party tooling). Configuring a self-hosted runner with gpus = "all" and an NVIDIA CUDA base image to run GPU-accelerated jobs. Basis for the GPU-runner option in the scale section. https://docs.gitlab.com/runner/configuration/gpus/

Why this matters

The wiring problem, stated plainly

Step one: separate the encode, the measure, and the decision

Step two: containerize the measurement so every runner agrees

Step three: cache the reference and the model, never re-fetch live

Step four: handle scale — fan out, and know when to reach for a GPU

Putting it together: the pipeline config

Step five: report so the result lands where the work is

A common mistake: the non-reproducible runner

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Integrating Video Quality Measurement Into CI/CD

Why this matters

The wiring problem, stated plainly

Step one: separate the encode, the measure, and the decision

Step two: containerize the measurement so every runner agrees

Step three: cache the reference and the model, never re-fetch live

Step four: handle scale — fan out, and know when to reach for a GPU

Putting it together: the pipeline config

Step five: report so the result lands where the work is

A common mistake: the non-reproducible runner

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

FFmpeg

VMAF

libvmaf

PSNR

Confidence interval

No-reference metric

Perceptual quality

Golden reference