Shot (or scene) detection finds the cuts in a video, the points where one continuous shot ends and the next begins, so each shot can be encoded and measured on its own. It rests on a simple assumption: frames within a single shot share spatio-temporal characteristics, similar motion and detail, which makes a shot the natural unit to optimize. In a per-shot encoding pipeline, detection supplies the boundaries the optimizer needs to build a convex hull per shot and choose the best resolution and quality setting for each. It also serves measurement directly: scoring per shot exposes the worst stretches that a single mean VMAF over a whole title quietly absorbs, the same reason teams read a low percentile rather than the average. The catch is that detection is imperfect, missed or false cuts misalign the units, and on a full-reference metric a shot boundary that does not match between source and encode can corrupt the comparison, so the cuts must be consistent on both sides.

