Published 2026-06-03 · 26 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If your product has an AI feature that writes, describes, decides, or generates — auto-summaries of recordings, captions, a moderation call, a generated b-roll clip, an answer about a video — you cannot tell whether a change made it better or worse by eye, because there is no single right answer to diff against. Without a rig that scores quality automatically and repeatably, every model swap and prompt tweak is a guess, and "it looked fine in the demo" becomes your release criterion. This lesson is written for the product manager, founder, or engineering lead who has to decide whether a video AI feature is good enough to ship and good enough to keep shipping after the next model update. You do not have to build the rig yourself, but you do need to know what a trustworthy one contains, what makes a judge reliable or biased, and which questions expose a vendor who skipped the work. It is the deep, video-specific build that the AgentOps overview only sketched — the evaluation pillar, taken apart and reassembled for moving pictures.

What An Eval Rig Is — And Why "Just Check The Output" Fails

Start with the thing being tested. A video AI feature is any part of your product where a model turns video, or a question about video, into a result a person reads or watches: a one-paragraph summary of a two-hour webinar, a caption track, a "is this clip safe to publish" decision, an answer to "when does the speaker mention pricing," or a freshly generated five-second shot. These are the features Fora Soft's clients ask for most, and they share one awkward property — the output is open-ended. There is no single correct paragraph that summarizes a webinar. There are thousands of good ones and thousands of bad ones, and most of the bad ones look, at a glance, exactly like the good ones.

That property breaks the way software is normally tested. Ordinary code is checked with an assertion — a line that says "the result must equal this exact value," like 2 + 2 == 4. If the values match, the test passes; if not, it fails. This works because ordinary functions are deterministic and their answers are exact. A summary feature has neither quality: the model is non-deterministic (the same recording can produce a different, equally valid paragraph each run), and the answer is a judgment call, not a value. Asserting that the summary equals one reference paragraph would fail every good summary that happened to choose different words. The equals sign is the wrong tool.

An eval rig is the tool that replaces it. The word "rig" is borrowed from hardware testing, where a test rig is a bench of fixed equipment you strap a part into to measure how it performs under known conditions. A software eval rig is the same idea for an AI feature: a fixed set of test cases, a way to run the feature against all of them, a grader that scores each output, and a report you can compare run to run. Strap the feature in, pull the lever, read the dial. Change the model or the prompt, pull the lever again, and watch whether the dial moved up or down. Without the rig, you have opinions; with it, you have a number that means the same thing every week.

For most of software's history, the grader on that bench was a metric — a formula that compares the model's output to a human-written reference answer and returns a similarity score. The famous ones for language are BLEU and ROUGE (you can ignore the acronyms; both essentially count how many words and short phrases the output shares with the reference) and, for newer systems, BERTScore (which compares meanings using embeddings rather than exact words). These were the workhorses of machine-translation and summarization testing for two decades. The trouble is what they measure: overlap with one specific reference. And overlap is exactly the wrong thing to measure for open-ended video language.

Why The Old Metrics Go Blind On Video AI

Picture a clip of a cyclist. Your captioning feature outputs: "A man in a red jacket rides a bicycle past a fountain." The human-written reference says: "A cyclist in a crimson coat pedals by a water feature." These two sentences mean the same thing. A person would call the caption correct. But a word-overlap metric sees almost no shared words — "a" and little else — so it scores the caption near zero. The metric has punished a correct answer for the crime of using synonyms.

Now run the math the way the metric does, so the failure is concrete. Word-overlap scores are roughly the fraction of the output's words that also appear in the reference. The caption has nine content words; the number that also appear verbatim in the reference is, generously, one ("bicycle" and "bike" do not even match). That is a precision of about 1 ÷ 9:

shared content words ÷ output content words = 1 ÷ 9 = 0.11

A score of 0.11 on a scale where 1.0 is perfect — for a caption a human would wave through. This is not a contrived edge case; it is the normal case for open-ended outputs, which is why the research literature is blunt that traditional metrics like BLEU, ROUGE, and CIDEr are unreliable for open-ended video answers, where multiple valid answers exist and responses often carry extra detail or chain-of-thought the reference never had. The metric is measuring the wrong thing, confidently.

There is a second, worse failure that word-overlap metrics cannot even attempt: grounding. Suppose the summary of a surveillance clip reads, fluently, "A delivery driver left a package at 3:42 and drove away." It is well-written. It shares plenty of words with a plausible reference. And it is completely invented — the clip shows no delivery at all. A text-only metric comparing strings has no way to know, because it never looked at the video. Catching that requires a grader that can both judge language and watch the footage. That is the gap LLM-as-judge, and specifically VLM-as-judge, exists to fill.

A side-by-side diagram contrasting two graders scoring the same correct video caption — on the left a word-overlap metric like BLEU or ROUGE assigns a near-zero 0.11 score because the caption uses synonyms of the reference, on the right an LLM or VLM judge assigns a 5 of 5 because it reads the meaning and watches the footage, with a callout that only the judge can catch an invented detail that never appears in the video Figure 1. The same correct caption, two graders. A word-overlap metric punishes synonyms and is blind to whether the description matches the video; an LLM/VLM judge reads meaning and checks the footage.

LLM-As-Judge — The Grader That Reads For Meaning

LLM-as-judge is the practice of using a capable language model as the grader on your eval bench. Instead of comparing the feature's output to a reference word by word, you hand a strong model — the "judge" — a written rubric (the criteria for a good answer, in plain language), the original task, the feature's output, and optionally a reference answer for guidance. The judge returns a score and a short written critique explaining the score. You are replacing a brittle formula with a reader that understands what "good" means.

The idea earned its credibility with a 2023 study that pitted strong LLM judges against humans on thousands of model answers. Its headline result, from the MT-Bench and Chatbot Arena work by Zheng and colleagues, is the number that made the whole field move: a GPT-4 judge agreed with human preferences more than 80% of the time — which is the same rate at which two humans agree with each other. In other words, the judge disagreed with people about as often as people disagree among themselves. That is the bar that turned LLM-as-judge from a curiosity into the default way teams score open-ended AI outputs.

How does the judge avoid being vague? The strongest pattern is to make it think before it scores. A method called G-Eval (Liu and colleagues, 2023) does exactly this: it asks the judge to first write out the evaluation steps for the rubric — what to check, in what order — and only then fill in a score, a technique borrowed from "chain-of-thought" prompting where a model reasons step by step. On summarization, G-Eval reached the highest agreement with human ratings of any automatic method at the time (a Spearman correlation of about 0.51, where older metrics sat far lower). The lesson for your rig is concrete: a judge that explains its reasoning before committing to a number is both more accurate and more debuggable than one that blurts a score, because you can read why it graded the way it did and catch it when it is wrong.

Two practical knobs shape every judge. The first is the rubric: vague rubrics ("rate the quality 1–10") produce noisy, unrepeatable scores, while specific ones ("does the summary mention every agenda item; does it invent anything not in the transcript; is it under 120 words") produce stable ones. The second is the scale: coarse scales (a 1–5 rating, or a yes/no pass) agree with humans far better than fine ones (1–100), because neither humans nor models can meaningfully tell a 73 from a 76. The rule of thumb that falls out of the research: write the rubric like a checklist, and score on the smallest scale that still distinguishes good from bad.

VLM-As-Judge — The Judge Has To Watch The Video

Here is where video evaluation parts ways from text evaluation, and where most generic eval advice quietly fails. A plain LLM judge reads only text. If you ask it to grade a summary of a video, it can check whether the summary is well-written, internally consistent, and on-topic — but it cannot check whether the summary is true to the footage, because it never saw the footage. It is grading an essay about a film it did not watch. For the grounding failures from earlier — the invented delivery, the hallucinated speaker — a text-only judge is useless.

The fix is a VLM-as-judge: a judge built on a vision-language model (VLM), the kind of model covered in the video VLMs lesson, which can take images or video frames as input alongside text. You give this judge three things — the rubric, the feature's output, and a sample of frames from the actual video — and now it can do what a text judge cannot: check the description against the pixels. Did the summary's "red jacket" appear on screen? Was there a delivery at 3:42, or not? The judge watches, then grades. Grounding becomes checkable.

This is an active research frontier, not a solved problem, and your rig should treat it with appropriate care. A 2025 system called VideoJudge built small (3-billion- and 7-billion-parameter) MLLM judges specialized to grade video-understanding outputs, and reported that its 7B judge correlated more strongly with human ratings than far larger general models — beating 32B and 72B baselines on three of four meta-evaluation benchmarks. The encouraging signal is that a purpose-built video judge can outperform a giant general one and cost a fraction as much to run. The cautionary signal, from companion work asking pointedly whether a video language model is a reliable judge at all, is that VLM judges inherit every bias of text judges plus new ones around how many frames they saw and what they missed between samples. A VLM judge is the right tool for video grounding; it is not an oracle, and the rest of this lesson is about keeping it honest.

A diagram comparing a text-only LLM judge and a VLM judge grading the same video summary — the LLM judge receives only the rubric and the text output and cannot verify grounding so a fluent but invented detail passes, while the VLM judge additionally receives sampled video frames, checks the claim against the pixels, and catches the hallucination Figure 2. For video, the judge must watch. A text-only judge can grade fluency but not truth; a VLM judge reads sampled frames and catches claims the footage never supports.

The Anatomy Of An Eval Rig

With both kinds of judge on the table, assemble the bench. A working video eval rig has five parts, and each one fails in its own way if you skip it.

The first part is the golden set — a curated collection of representative test cases with known-good outcomes, the way a teacher keeps a bank of exam questions with an answer key. For a summarization feature, each case is a real recording plus a human-approved summary (or a clear rubric for what a good one must contain). The golden set is the single most valuable asset in the whole rig, because it is the fixed ground the dial is measured against. A good one is small enough to run often (dozens to a few hundred cases, not thousands) and deliberately stocked with the hard, weird, and adversarial cases that break things — the silent video, the two-speaker overlap, the clip that is mostly a blank slide.

The second part is the feature under test — your actual model and prompt, run against every case in the golden set to produce outputs. The third part is the judge — the LLM or VLM grader with its rubric, scoring each output. The fourth is the scorecard — the aggregated result, not just an average but a breakdown: which cases passed, which failed, and how reliably, so you see where quality lives and dies rather than a single blurred number. The fifth is the gate — the rule that turns the scorecard into a decision: block the release if the average drops, if any critical case regresses, or if the hallucination rate climbs above a threshold.

These five parts run in two different places, and the difference matters. Offline evaluation runs the rig against the fixed golden set during development and in the release pipeline — it is the controlled quality gate before anything ships, the AI equivalent of running your tests before merging code. Online evaluation runs the judge against a sample of real production traffic after launch, catching the drift and the strange real-world inputs that no golden set anticipated. The two feed each other: every interesting failure caught online gets added to the offline golden set, so the rig gets smarter every week. This loop — production failures becoming permanent test cases — is what separates a rig that decays from one that compounds.

A left-to-right pipeline diagram of an eval rig — a golden set of video test cases feeds the AI feature under test, whose outputs flow to a judge model carrying a rubric, producing a scorecard that drives a release gate, with a lower loop showing online evaluation sampling live production traffic and feeding new failure cases back into the golden set Figure 3. The five parts of an eval rig. Golden set, feature under test, judge, scorecard, gate — run offline as a release gate, and online as a sampler whose failures flow back into the golden set.

Scoring Modes — Pointwise, Pairwise, And Reference-Free

A judge can grade in more than one shape, and choosing the right shape is half of building a rig that gives stable numbers. There are three decisions.

The first decision is pointwise versus pairwise. A pointwise judge looks at one output and gives it a score against the rubric — "this summary: 4 of 5." A pairwise judge looks at two outputs of the same task and says which is better — "summary A beats summary B." Pairwise grading is usually the more reliable of the two, because "which of these is better" is an easier, steadier question than "what number is this worth," and the research on multimodal judges bears this out: vision-language models show strong, human-like agreement on pair comparisons but wobble noticeably when asked for absolute scores or to rank a batch. The trade-off is cost and use: pairwise is ideal for choosing between two candidate models or prompts (an A/B decision), while pointwise is what you need to track a single feature's quality over time on a fixed scale. Many rigs use both — pairwise to pick a new model, pointwise to monitor it after.

The second decision is reference-based versus reference-free. A reference-based judge is given a human-written gold answer to compare against; a reference-free judge grades the output against the rubric and, for video, the footage alone, with no gold answer at all. Reference-based grading is more accurate but expensive, because someone has to write the gold answers; reference-free grading scales to production, where no gold answer exists for live traffic, at some cost in precision. The practical pattern: reference-based offline on the golden set where you have written answers, reference-free online where you do not.

The third decision is the scale and rubric shape, covered earlier — coarse scales and checklist-style rubrics beat fine scales and vague ones. Put the three decisions in one place:

Decision Option A Option B Use which, when
Grading shape Pointwise (score one output) Pairwise (A vs B) Pointwise to track quality over time; pairwise to choose a model or prompt
Reference Reference-based (gold answer) Reference-free (rubric + footage only) Reference-based offline on the golden set; reference-free online in production
Scale Coarse (pass/fail or 1–5) Fine (1–100) Always coarse — fine scales add noise, not precision

Judging The Judge — The Step Teams Skip

Here is the question that separates a rig you can trust from one that quietly lies to you: how do you know the judge is any good? A biased or careless judge does not announce itself. It returns confident scores that happen to be wrong, and if you gate releases on those scores, you ship regressions while the dashboard stays green. So before a judge is allowed to grade anything that matters, it has to pass its own exam. This step is called meta-evaluation, and skipping it is the most common reason an eval rig gives false comfort.

The exam is simple in principle. Take a few dozen cases from your golden set and have humans score them carefully — this is your calibration set with trusted, human-assigned grades. Now run the judge on the same cases and measure how often it agrees with the humans. The standard agreement measures are Cohen's kappa (a score from 0 to 1 that measures agreement while correcting for the agreement you would get by random chance) and Spearman correlation (how well the judge's ranking of outputs matches the humans'). If the judge agrees with your human graders at a high rate, you can trust it to stand in for them at scale. If it does not, you fix the rubric or change the judge model before trusting a single automated score. A judge that has not been checked against humans is a rumor, not a measurement.

The reason this check is non-negotiable is that LLM and VLM judges carry well-documented biases — systematic errors that survive even in frontier models. Three matter most. Position bias: when grading two outputs, judges favor whichever they see first — the MT-Bench study found a GPT-4 judge flipped its verdict on a large share of comparisons simply when the order was swapped. Verbosity bias: judges tend to reward longer, more elaborate answers even when a short one is better. Self-preference bias (also called self-enhancement): a judge tends to favor outputs written by itself or its own model family. None of these is hypothetical; all are measured, repeatable effects.

The good news is that each has a cheap defense, and a competent rig applies all three. Against position bias: grade every pairwise comparison twice, with the order swapped, and only count it as a win if the same answer wins both ways — the survey literature is explicit that randomizing or swapping order is the standard fix. Against verbosity bias: put a length limit in the rubric and tell the judge to ignore length. Against self-preference bias: use a judge from a different model family than the one that generated the output, so the grader is not marking its own homework. Bias you have neutralized is harmless; bias you have ignored is a silent thumb on the scale.

A diagram of the judge-the-judge meta-evaluation loop and bias controls — a small human-labeled calibration set is compared against the judge's scores using Cohen's kappa and Spearman correlation to confirm agreement before trusting the judge, alongside three bias controls: swap-and-average to neutralize position bias, a length cap in the rubric for verbosity bias, and a cross-family judge for self-preference bias Figure 4. Before you trust a judge, test it. Measure its agreement with human graders (Cohen's kappa, Spearman), and neutralize the three standing biases — position, verbosity, self-preference — with swap-and-average, a length cap, and a cross-family grader.

A Worked Example — Is This Judge Trustworthy?

Make the meta-evaluation concrete, because the arithmetic is the whole point. Suppose you are building an auto-summary feature for recorded webinars and you want to use a VLM judge to grade it. You take 50 recordings from your golden set, have two product specialists score each summary "acceptable" or "not acceptable," and keep only the 50 cases where both humans agreed — your trusted labels. Now you run the candidate judge on the same 50.

The judge matches the human verdict on 43 of the 50 cases and disagrees on 7. Raw agreement is the simple part:

43 agreements ÷ 50 cases = 0.86 = 86% raw agreement

But raw agreement flatters any judge, because some of those matches happen by luck — if 70% of summaries are "acceptable," a judge that says "acceptable" to everything is right 70% of the time while knowing nothing. Cohen's kappa corrects for that luck. With this label balance, an 86% raw agreement works out to a kappa of roughly 0.68 — which, on the usual reading of kappa, means "substantial" agreement: good enough to trust as a stand-in for humans, while still worth improving. Had the kappa come back at 0.2, you would know the 86% was mostly luck and the judge was near useless, no matter how confident its scores looked. The number you act on is the chance-corrected one, not the raw percentage.

Common mistake. Trusting a judge's scores because they are precise and consistent, without ever checking them against humans. A judge can be perfectly repeatable and perfectly wrong — it will hand you the same biased score every time. Precision is not accuracy. Until you have measured the judge's agreement with human graders on a labeled set (and corrected for chance with kappa), every number it produces is decoration. Calibrate first, gate second.

The Tooling Landscape In 2026

You do not have to build the bench from raw parts. By 2026 there is a mature layer of eval tooling, and the names cluster by job. For writing and running the tests, DeepEval is the closest thing to a unit-testing framework for AI outputs — it works like pytest, the standard Python testing tool, and ships dozens of ready-made metrics including LLM-as-judge graders. Promptfoo leans toward adversarial and red-team testing driven by a simple config file, useful for trying to break a feature before users do. Ragas specializes in evaluating retrieval-augmented systems, which matters if your video feature answers questions by searching an archive (the pattern from the video RAG lesson). For the dashboards, experiment tracking, and production monitoring that turn evals into a team practice, LangSmith, Braintrust, and Arize Phoenix carry the scorecards, the human-annotation queues, and the release gates.

The pattern most experienced teams converge on is two tools, not one: a lightweight framework that runs in the release pipeline as a gate (DeepEval, Promptfoo, or Ragas) paired with a platform that tracks regressions and hosts human review (Braintrust, LangSmith, or Arize). The first answers "should this build ship"; the second answers "is the feature drifting in production, and where." For video work, the missing piece in most of these is the VLM judge — the grounding check against frames — which today you usually wire in yourself by calling a vision-capable model as the grader inside whichever framework you chose. That integration gap is exactly where careful engineering pays off, and where a generic eval setup quietly stops covering video.

Where The Standards Sit

Evaluation is also where AI governance stops being abstract. The NIST AI Risk Management Framework, the US standards body's reference for trustworthy AI, places exactly this work in its MEASURE function — the part of the framework devoted to TEVV, short for Test, Evaluation, Verification, and Validation. In NIST's structure, measuring an AI system's trustworthiness and monitoring it continuously after deployment is not optional polish; it is a named obligation of a responsible deployment. An eval rig is, in plain terms, how an organization does the MEASURE function for a video AI feature. The ISO/IEC 42001 standard, the international management-system standard for AI published in 2023, sets the same expectation from the governance side: a deployed AI system must be evaluated and monitored against defined criteria, with the evidence written down. Neither document tells you which judge model to use. Both tell you that "we tested it once and it seemed fine" is not a defensible answer, and that the rig and its results are the artifacts an auditor will ask to see — a thread this section picks up in the EU AI Act and disclosure lesson.

What This Costs To Run

A judge is itself a model call, so an eval rig has a running cost, and the cost lands in two places. Offline, the cost is bounded and small: a golden set of 200 cases graded by a judge on every release is 200 model calls per run — pennies to a few dollars, trivial next to the cost of shipping a regression. Online, the cost scales with traffic, because grading a sample of live outputs means extra model calls on top of the feature itself. The standard control is sampling: you do not judge every production output, you judge a representative slice — 1% or 5% — which catches drift without doubling your bill. The detailed token economics live in the cost-of-AI lesson and the cost-optimization lesson, and the serving choices — running a small judge model like a 7B video judge on your own hardware instead of calling a frontier API — belong to the inference-serving lesson. The structural point is the one that should guide the budget: a cheap small judge that you have calibrated and proven agrees with humans is worth more than an expensive frontier judge you have never checked. Spend the money on the calibration, not the model.

Where Fora Soft Fits In

We build video products across conferencing, streaming and OTT, e-learning, telemedicine, and surveillance, and almost every AI feature we add to them — recording summaries, auto-captions, moderation decisions, archive question-answering, generated b-roll — is open-ended, which means none of it can be tested with an equals sign. Our practice is the rig described here: a golden set drawn from real client footage including the hard and adversarial cases; a VLM-as-judge that grades against the frames so a fluent-but-invented summary cannot pass; a meta-evaluation step that proves the judge agrees with human reviewers before it is allowed to gate anything; and the same rig run twice over — offline as a release gate, online as a sampler whose failures flow back into the golden set. The verticals change the rubric, not the method: a telemedicine scribe is judged for clinical faithfulness, a surveillance summary for grounding, a moderation call for precision and recall, but each is the same five-part bench with a domain-specific rubric and a judge that has earned its place by agreeing with people.

What To Read Next

Talk To Us · See Our Work · Download

  • Talk to a video engineer — put a trustworthy eval rig (golden set, VLM-as-judge, calibration) around an AI feature in your video product: /services/ai-software-development
  • See our case studies — conferencing, streaming, surveillance, telemedicine, and AI work: /portfolio
  • Download the video eval rig checklist — the five parts of the bench, the scoring-mode decisions, the three bias controls, and the judge-the-judge calibration step on one page: Download the checklist

References

  1. Zheng, Chiang, Sheng, Zhuang, Wu, Zhuang, et al. — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (arXiv:2306.05685; NeurIPS 2023 Datasets and Benchmarks Track, accessed June 2026) — https://arxiv.org/abs/2306.05685 — tier 5 (peer-reviewed). Source for the foundational LLM-as-judge result that a GPT-4 judge reaches over 80% agreement with human preferences — the same level humans agree with each other — and for the first rigorous documentation of position bias, verbosity bias, and self-enhancement bias in LLM judges, plus the swap-order mitigation.
  2. Liu, Iter, Xu, Wang, Xu, Zhu — "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (arXiv:2303.16634; EMNLP 2023, accessed June 2026) — https://arxiv.org/abs/2303.16634 — tier 5 (peer-reviewed). Source for the chain-of-thought-plus-form-filling judging method and the finding that G-Eval reached the highest agreement with human ratings of any automatic metric on summarization (Spearman ≈ 0.51), establishing that a judge that writes its evaluation steps before scoring is both more accurate and more interpretable.
  3. Huang, He, Yu, Zhang, Si, Jiang, et al. — "VBench: Comprehensive Benchmark Suite for Video Generative Models" (arXiv:2311.17982; CVPR 2024 Highlight, accessed June 2026) — https://arxiv.org/abs/2311.17982 — tier 5 (peer-reviewed). Source for the 16 disentangled, human-aligned evaluation dimensions for generated video (subject consistency, motion smoothness, temporal flickering, spatial relationship, and more) and for the human-preference annotation set used to validate that the automatic dimensions track human perception.
  4. Fu, Dai, Sun, Zhang, et al. — "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis" (arXiv:2405.21075; CVPR 2025, accessed June 2026) — https://arxiv.org/abs/2405.21075 — tier 5 (peer-reviewed). Source for the multiple-choice video-understanding benchmark spanning 11 seconds to 1 hour across 6 domains and 30 subfields, adopted as an industry-standard long-context measure (Gemini 2.5 Pro scores 84.8%), and the illustration that closed benchmarks use MCQ accuracy while open-ended outputs require a judge.
  5. Wu, et al. — "VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding" (arXiv:2509.21451, 2025, accessed June 2026) — https://arxiv.org/abs/2509.21451 — tier 5 (peer-reviewed preprint). Source for purpose-built 3B/7B video (MLLM) judges that grade outputs conditioned on the video, the pointwise-and-pairwise framing, and the result that VideoJudge-7B correlates more strongly with human ratings than far larger general models (beating 32B/72B baselines on three of four meta-evaluation benchmarks).
  6. Li, Chen, Yang, et al. — "A Survey on LLM-as-a-Judge" (arXiv:2411.15594, 2024–2025, accessed June 2026) — https://arxiv.org/abs/2411.15594 — tier 5 (peer-reviewed survey). Source for the unified taxonomy of pointwise, pairwise, and listwise modes and reference-based vs reference-free settings, and for the meta-evaluation practice of measuring judge–human agreement with Cohen's kappa and Spearman correlation, including the standard guidance to randomize/swap response order against position bias.
  7. Wang, et al. — "AutoEval-Video: An Automatic Benchmark for Assessing Large Vision-Language Models in Open-Ended Video Question Answering" (arXiv:2311.14906, accessed June 2026) — https://arxiv.org/abs/2311.14906 — tier 5 (peer-reviewed). Source for the failure of BLEU/ROUGE/CIDEr on open-ended video answers and for the finding that a GPT-4 judge using instance-specific rules reaches roughly 97% evaluation accuracy, comparable to human evaluators — the empirical basis for preferring an LLM/VLM judge over lexical-overlap metrics on video language.
  8. National Institute of Standards and Technology — "AI Risk Management Framework (AI RMF 1.0)" and the "Test, Evaluation, Validation and Verification (TEVV)" program materials (NIST, 2023–2026, accessed June 2026) — https://www.nist.gov/itl/ai-risk-management-framework and https://www.nist.gov/ai-test-evaluation-validation-and-verification-tevv — tier 1 (US government standards body). Source for the MEASURE function and the TEVV (Test, Evaluation, Verification, Validation) process — the governance obligation that an eval rig operationalizes for a video AI feature, including MEASURE 2 (trustworthiness) and MEASURE 3 (continuous monitoring).
  9. International Organization for Standardization / IEC — "ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system" (ISO, 2023, accessed June 2026) — https://www.iso.org/standard/81230.html — tier 1 (international standard). Source for the AI-management-system requirement that deployed AI be evaluated and monitored against defined criteria with documented evidence — the governance counterpart to NIST's MEASURE function that an eval rig and its scorecards satisfy.
  10. Inference.net / Braintrust / Confident AI — "LLM Evaluation Tools Comparison (2026)," "DeepEval Alternatives (2026)," and "Best LLM Evaluation Tools" (vendor engineering references, 2026, accessed June 2026) — https://inference.net/content/llm-evaluation-tools-comparison/ — tier 6 (engineering reference). Source for the 2026 tooling landscape: DeepEval as a pytest-style framework with dozens of built-in metrics, Promptfoo for red-team/adversarial CI, Ragas for RAG evaluation, and Braintrust/LangSmith/Arize Phoenix for regression tracking, human annotation, and production monitoring, plus the two-tool pattern (lightweight CI gate + monitoring platform).
  11. Chen, et al. — "Is Your Video Language Model a Reliable Judge?" (arXiv:2503.05977, 2025, accessed June 2026) — https://arxiv.org/abs/2503.05977 — tier 5 (peer-reviewed preprint). Source for the caution that VLM/video judges inherit text-judge biases and add new failure modes tied to frame sampling and missed content, motivating the meta-evaluation and bias-control steps before a video judge is trusted to gate releases.