Quality regression testing re-encodes a fixed set of reference clips on every change to the pipeline and compares the new quality scores against a saved baseline, the golden reference, so a build that quietly lowers quality is caught before it ships. It is borrowed from software's golden-master and characterization testing, with one twist: a quality score such as VMAF is not exact, so you compare against a tolerance band sized to the metric's noise (model confidence interval plus encoder run-to-run jitter), never an exact line. One clip is not a suite; the reference set must span the content that fools metrics, grain, dark scenes, high motion, screen text, animation, or the test passes blind to a regression in the gap. The hardest catch is slow drift: a series of builds each losing a fraction of a point, every step inside the band, that sums to a visible loss only when measured against a frozen baseline rather than the last build. It is the standing suite behind a single quality gate.

