A golden reference is a fixed, version-controlled clip together with the quality scores a known-good build produced on it, saved as the baseline a regression test compares against. Borrowed from software golden-master (characterization) testing, it does not measure whether an encode is good, only whether a new build is unchanged from the trusted one. In a pipeline, every change re-encodes the reference clips, scores them, and fails if the new scores drop beyond a tolerance band sized to the metric's noise rather than an exact line. The reference source must have headroom (a good encode landing high on the scale) so a drop is visible, and the source clip, the VMAF model file, and the tool versions must all be pinned. The key catch is to always compare against a frozen baseline, not the last build, or slow drift, fractions of a point each build, escapes the band until it sums to a visible loss. Re-baseline only by deliberate human approval, never on a red build.

