One page: how to build a trustworthy eval rig for a video AI feature. The five parts of the bench (golden set, feature under test, judge, scorecard, gate); why word-overlap metrics (BLEU/ROUGE) and text-only judges fail on video; when to use a VLM-as-judge that reads sampled frames to check grounding; the three scoring-mode decisions (pointwise vs pairwise, reference-based vs reference-free, coarse vs fine scale); the three bias controls (swap-and-average for position bias, length cap for verbosity bias, cross-family judge for self-preference bias); and the judge-the-judge calibration step (measure agreement with humans via Cohen's kappa before trusting any automated score). Maps to the NIST AI RMF MEASURE function and ISO/IEC 42001.
Download free PDF