Why This Matters

If you build a learning product, run a training program, or own a course catalog, assessment is the most expensive content to author and the first thing that goes stale when a course updates. A subject expert writes good test questions slowly — a few solid items an hour — so most video courses ship with too few questions, or none. Automatic generation changes that economics: a model drafts a question bank from the transcript you already have, and your experts shift from writing to reviewing. This article gives you the quality reality, the review workflow, the standards that make a generated question countable, and the cost math, so you can decide how far to lean on AI here and brief your engineers and instructional designers without buying a demo. It builds on where AI fits in a learning product and on the quiz mechanics in in-player quizzes and polls.

What "Automatic Quiz Generation" Actually Means

"Have the AI write the quiz" hides a pipeline, and confusing the pipeline with the model is the first mistake. The model is only the middle of it.

The standard that this article leans on for one term up front: a large language model, or LLM, is the AI software — the same family that powers chat assistants — that reads text and produces text, including questions and answers. Feed it the words of a lecture and it can draft a question about them. That is the part everyone pictures. But a usable quiz is more than a model call: it is a transcript that has been cleaned, a model prompted with the learning objectives, a set of draft items in a known format, a human who checks them, and a publishing step that makes each item scorable and trackable inside your platform.

The useful analogy: the LLM is a fast, well-read teaching assistant who has watched the lecture once and will happily write you fifty questions in a minute. They are quick and tireless, but they sometimes misremember a detail, they write wrong answers that are too obviously wrong, and they reach for recall questions because those are easiest. You would never ship that assistant's first draft to learners without reading it. The pipeline is how you turn the draft into an assessment.

Automatic quiz generation pipeline: cleaned transcript to LLM drafting questions to a human review gate to QTI or xAPI publishing and tracking. Figure 1. The automatic quiz-generation pipeline end to end. The model drafts in the cheap middle; the human-review gate is where quality is met, and the publishing step is where each question becomes scorable and trackable.

How Generation Works, From a Transcript

Walk the pipeline once, because each stage decides what the next can do.

It begins with a source transcript: the corrected caption file or transcript of the video. Generation quality is capped by this input — a model cannot ask a good question about a word the transcript got wrong. If you are working from automatic captions, finish the human-edit step first; this is the same source-quality rule that governs automatic captions for learning video. A clean transcript with chapter markers also lets you generate questions per chapter, which matters because a quiz placed right after the segment it tests is far more effective than one dumped at the end.

Next, the transcript and a set of instructions go to the language model. Good prompts do three things: they pin the question to a stated learning objective (what the segment is supposed to teach), they name the item type wanted (multiple-choice, true/false, and so on), and they often retrieve only the relevant transcript passage rather than the whole course — a technique called retrieval-augmented generation, or RAG, that keeps the model anchored to the real material and cuts the chance it invents a fact. The model internals — how the LLM and the retrieval layer work — live in our AI for Video Engineering section; this article covers the learning wiring and the product decision, not the model architecture. See the video-summarization 2026 playbook for the closest model-side treatment.

The model emits draft items. A single multiple-choice item has three parts: the stem (the question), the key (the correct answer), and the distractors (the plausible wrong answers). The model is reliably good at the stem and the key. The distractors are where it struggles, and that is the heart of this article.

Then comes the step that separates an assessment from a liability: the human-review gate. A subject expert checks each item for factual accuracy, distractor quality, alignment to the objective, cognitive level, and bias, then accepts, edits, or rejects it. This is the same review gate every AI learning feature needs, and for assessment it is non-negotiable because a wrong answer key teaches the wrong thing to everyone who takes the quiz.

Only then does the result become a published, trackable item: stored in a portable format, rendered in the player, scored on submission, and emitted as a statement your analytics can count. We come back to exactly how below.

The Hard Part: Distractor Quality

Here is the misunderstanding that costs the most. Writing the question is easy; writing the wrong answers is the craft, and it is where automatic generation most often falls short.

A good distractor is wrong but plausible — it reflects a real misconception a learner might hold, so choosing it tells you something diagnostic. A bad distractor is wrong and obviously wrong, so a learner eliminates it without knowing the material, and the question stops measuring anything. Decades of testing research (the standard reference is Haladyna's guidelines for writing multiple-choice items) make the same point: the difficulty and the diagnostic value of a multiple-choice question live almost entirely in its distractors.

Left alone, an LLM tends to produce one strong distractor and two weak ones, or distractors that are grammatically inconsistent with the stem (a classic giveaway — the grammatically matching option is the answer). It also likes "all of the above" and absolute words like "always" and "never," both of which test-wise learners exploit. So the highest-leverage instruction you can give the generator is about distractors: ask for distractors drawn from common misconceptions about the topic, matched in length and grammar to the key, with no "all/none of the above." Even then, a human confirms each distractor is plausible and genuinely wrong — the single most valuable thing the reviewer does.

Anatomy of a generated multiple-choice item: stem, correct key, and three distractors, with plausible versus giveaway distractors marked. Figure 2. Anatomy of a generated question. The model is reliable on the stem and the key; the distractors — plausible-but-wrong options — are the hard part and the focus of human review.

What It Gets Right, and Where It Misleads

Automatic generation is genuinely good at one band of questions and genuinely weak at another, and knowing the line keeps you from over-trusting it.

Educators sort questions by the kind of thinking they demand, using a long-standing ladder called Bloom's taxonomy (in its revised 2001 form): at the bottom, remember and understand (recall a fact, restate a definition); in the middle, apply and analyze (use a procedure, compare cases); at the top, evaluate and create (judge a trade-off, design something new). LLMs cluster their output at the bottom two rungs, because recall and definition questions can be lifted almost directly from the transcript. They can be pushed up to "apply" and "analyze" with careful prompting and good source material, but they rarely produce a strong "evaluate" or "create" item on their own — and those are the questions that measure real understanding.

The second failure mode is the one that defines all generative AI: hallucination — the model stating something fluent and false. In a quiz this is especially dangerous because it can land in the answer key, where it is presented to every learner as correct. Retrieval-augmented generation reduces it by anchoring the model to the actual transcript, but it does not eliminate it, which is why a human verifies every key against the source.

Bloom's taxonomy ladder showing automatic generation clustering at remember and understand, reaching apply and analyze with effort, and rarely producing evaluate or create items. Figure 3. Where automatic generation lands on Bloom's taxonomy. It is strong at recall and comprehension, reaches application and analysis with effort, and rarely produces sound higher-order items — exactly the ones that measure deep learning.

The Human-Review Gate

Because the model is strong on the easy half and weak on the hard half, the review gate is not optional polish — it is the step that produces the assessment. Make it a defined workflow, not a vibe check.

A reviewer asks five questions of every item. Is the key factually correct against the source? Are the distractors plausible and genuinely wrong, matched in length and grammar? Does the item align with the learning objective for that segment, rather than testing a trivial aside? Is it at the right cognitive level — and does the bank as a whole include some apply/analyze items, not only recall? Is it free of bias and culturally loaded examples that disadvantage some learners? An item passes only when all five hold; otherwise the reviewer edits or rejects it.

1EdTech (the standards body behind QTI and LTI) published AI-Generated Content Best Practices (v1.0, 2024) precisely because this workflow needs guardrails: it recommends human oversight of AI-generated assessment content, transparency about what was machine-generated, and attention to bias and accessibility. Treat that document as the policy spine for your review gate. The practical reframe to give stakeholders: AI does not remove the expert from assessment authoring; it moves the expert from writing to reviewing and editing, which is roughly three to five times faster per item and produces a larger, fresher bank.

Human-review gate for generated questions: five checks — factual accuracy, distractor quality, objective alignment, cognitive level, and bias — gating accept, edit, or reject. Figure 4. The human-review gate. Every generated item passes five checks before it reaches a learner; failing any one sends it back to edit or reject. This is the step that turns a draft into an assessment.

Making Generated Questions Count: Standards and Tracking

A question the player cannot score and the analytics cannot count is not finished — it is a paragraph of text. Two families of standards make a generated item a real, portable, trackable assessment, and this section always insists on naming them.

The first is QTI — Question and Test Interoperability, the 1EdTech standard (current version 3.0, 2022) that packages questions and tests so any conformant system can present, score, and exchange them. Think of QTI as a shipping container for assessment: it carries the stem, the options, the scoring logic, and even item statistics like difficulty, so a question authored or generated in one tool can move into another LMS without being rebuilt. If you generate items and store them as QTI 3.0, you keep them portable and you inherit its accessibility features (QTI 3.0 aligns to WCAG 2.1 AA and Section 508) instead of hand-rolling them. Generating straight into QTI also forces the generator to commit to a real item structure rather than loose prose.

The second is xAPI — the Experience API, the standard that records learning as statements like "Maria answered Question 4 correctly" to a Learning Record Store, the database those statements live in. xAPI (version 1.0.3, ADL) defines a fixed set of interaction types your generated question must map to, so the result is machine-countable. The valid types are exactly: true-false, choice (multiple choice), fill-in, long-fill-in, matching, sequencing, likert, numeric, performance, and other. Generate to one of these types — not to a freeform shape your analytics cannot parse — and each attempt emits a structured statement carrying the question, the learner's response, and whether it was correct. For the statement design in depth, see tracking video with xAPI; for packaging inside a course, see SCORM explained.

A minimal xAPI statement for one answered question, with obviously fake data:

{
  "actor": { "name": "Test Learner", "mbox": "mailto:learner@example.org" },
  "verb": { "id": "http://adlnet.gov/expapi/verbs/answered",
            "display": { "en-US": "answered" } },
  "object": {
    "id": "https://example.org/course/safety/q4",
    "definition": {
      "type": "http://adlnet.gov/expapi/activities/cmi.interaction",
      "interactionType": "choice",
      "correctResponsesPattern": ["b"]
    }
  },
  "result": { "success": true, "response": "b" }
}

The point of the example is the shape, not the syntax: because the item declares its interactionType and correctResponsesPattern, the platform — not a human — knows whether the answer was right and can roll it into learning metrics. The discipline that makes this useful at scale is stable naming: give each generated item a durable id and a consistent verb, so "answered the safety quiz" is one metric across every learner and every language, not a scatter of one-off events.

Standard / format What it carries Best for Standards support
QTI 3.0 Stem, options, scoring logic, item statistics Portable item banks, exchange between tools QTI native; renders in any QTI-conformant LMS
xAPI 1.0.3 interaction The attempt: question, response, correct/not Tracking answers as analytics statements xAPI; pairs with cmi5 for course-level pass/fail
SCORM interactions Fixed completion/score/interactions set Tracking inside a SCORM launch in an older LMS SCORM 1.2 / 2004; limited vs xAPI
Raw player JSON Whatever you define Quick prototype only — not portable None — locks you in, hard to report on

Are the Questions Any Good? A Word on Psychometrics

Generating an item and shipping it is not the end. A question only earns its place by behaving well when real learners answer it, and the field that measures this is psychometrics — the statistics of test items. Two numbers matter most. Item difficulty is the share of learners who get it right; an item that everyone or no one passes measures nothing. Item discrimination is whether stronger learners outperform weaker ones on that item; a question that good and poor learners answer identically is broken. A distractor analysis then checks that each wrong option actually attracts some learners — a distractor nobody ever picks is dead weight, exactly the weak-distractor problem generation creates.

The practical loop: generate, review, pilot the item with a cohort, read its difficulty and discrimination, and keep, fix, or retire it. This is also where the academic field of automatic item generation (AIG) — a discipline that predates LLMs and uses expert-built templates to mass-produce calibrated items — meets the new generative approach. Templates give statistical control but cost expert effort up front; LLMs give speed and breadth but need the review and the pilot to earn the same trust. Most serious programs will blend them. None of this is unique to e-learning, but it is the difference between a quiz that looks impressive in a demo and an assessment a certificate can stand on — which matters most when the result gates a credential, the territory of anti-cheating and assessment design.

A Common Mistake: Generate and Publish

The failure we see most is a team that runs the transcript through a model, gets a hundred questions, sees them render in the player, and marks the course "assessed." It demos beautifully and fails on contact with learners. A native test-taker spots the giveaway distractors and scores well without knowing the material; a careful learner hits a hallucinated answer key and is taught something false; and the whole bank sits at the bottom of Bloom's taxonomy, testing recall while the course claims to build judgment. The second, quieter failure is structural: questions generated as freeform text with no interactionType and no stable id, so the player cannot auto-score them and the analytics cannot report on them — a quiz that looks like assessment but produces no data. The fix for the first is the human-review gate with a distractor and accuracy pass; the fix for the second is generating straight into QTI 3.0 and xAPI interaction types from the start. Automatic generation is a draft. Shipping the draft is not assessment — it is the appearance of assessment, at the cost of the learners who trusted the score.

The Math: Building a 200-Question Bank

Lead with the business trade-off, because assessment authoring is the cost that scales with every course and every update.

Take a course that needs a 200-question bank. Author it the traditional way, with a subject expert writing items from scratch:

Expert authoring from scratch:
  ~4 solid items per hour
  200 items ÷ 4 = 50 hours of expert time
  50 hours × $80/hour ≈ $4,000 per bank

Now author it AI-assisted — the model drafts, the expert reviews and edits:

AI-assisted (generate, then review):
  Generation pass: minutes of compute ≈ negligible
  Expert review at ~15 reviewed/edited items per hour
  200 items ÷ 15 ≈ 13.3 hours of expert time
  13.3 hours × $80/hour ≈ $1,067 per bank

The review path costs roughly a quarter as much and finishes in a fraction of the time — and it scales: regenerating questions for an updated module is a cheap model pass plus a short review, not a fresh 50-hour authoring project. But read what the saving is not: it is not "free questions." The review and the pilot are the parts that buy validity, and cutting them is the false economy that ships a hundred broken items. The realistic claim is a three-to-five-times speed-up per item with the expert kept firmly in the loop — and a question bank that stays fresh as the course changes. For the broader build-and-run picture, see the learning-platform cost model and building vs buying AI features, and the cost.

Cost to build a 200-question bank: expert authoring from scratch versus AI-assisted generation with human review, in US dollars and expert hours. Figure 5. The authoring cost trade-off. AI-assisted generation with a human-review gate costs roughly a quarter of from-scratch authoring for a 200-question bank — and the review step is the part that buys validity.

Where Fora Soft Fits In

We build assessment generation into the learning product, not as a bolt-on that emits orphaned text. Fora Soft has shipped video conferencing, streaming, e-learning, and AI-driven video features since 2005, so when a client wants AI-generated quizzes we start from the build-vs-buy trade-off: an assessment-generation vendor is fastest to a working demo and bills per item or per seat, while a generation pipeline wired into your own authoring system costs engineering up front but keeps the question bank, the corrected items, and the learner-response data in-house. We wire the cleaned-transcript input, the objective-anchored generation prompt, the human-review gate, and output straight into QTI 3.0 items and xAPI interaction statements, so each question enters your catalog as a portable, auto-scored, trackable asset rather than a paragraph the player cannot grade. We are candid that the review gate and an item pilot are the parts that buy validity — and that skipping them is the costliest shortcut in AI assessment.

What to Read Next

Call to action

References

  1. Question & Test Interoperability (QTI) 3.0 — Implementation Guide — 1EdTech Consortium, 2022. The standard for packaging, scoring, and exchanging assessment items and tests; carries scoring logic and item statistics; aligns to WCAG 2.1 AA and Section 508. Tier 1. https://www.imsglobal.org/spec/qti/v3p0/impl
  2. Experience API (xAPI) Specification, version 1.0.3, Part 2: Data — Interaction Activities — Advanced Distributed Learning (ADL) Initiative. Defines the valid interaction types (true-false, choice, fill-in, long-fill-in, matching, sequencing, likert, numeric, performance, other) and the correctResponsesPattern formats a generated question must map to. Tier 1. https://github.com/adlnet/xAPI-Spec/blob/master/xAPI-Data.md
  3. AI-Generated Content Best Practices, v1.0 — 1EdTech Consortium, 2024. Guidance for AI-generated learning and assessment content: human oversight, transparency about machine-generated material, bias and accessibility attention. Tier 1. https://www.imsglobal.org/resource/AI-Generated_Content_Best_Practices/v1p0
  4. cmi5 Specification — Advanced Distributed Learning (ADL). The xAPI profile that carries course-level launch and pass/fail, into which a generated quiz's interaction statements roll up. Tier 1. https://github.com/AICC/CMI-5_Spec_Current
  5. Web Content Accessibility Guidelines (WCAG) 2.1, Level AA — World Wide Web Consortium (W3C), 2018. The accessibility bar QTI 3.0 assessment presentation aligns to; generated items must meet it too. Tier 1. https://www.w3.org/TR/WCAG21/
  6. Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. — "A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment" — Applied Measurement in Education, 2002. The standard guidance that item difficulty and diagnostic value live in the distractors; the basis for the review gate's distractor checks. Tier 5. https://doi.org/10.1207/S15324818AME1503_5
  7. Anderson, L. W., & Krathwohl, D. R. (Eds.) — A Taxonomy for Learning, Teaching, and Assessing (the revised Bloom's taxonomy) — 2001. The cognitive-level ladder (remember → create) used to judge where generated items land. Tier 5. https://www.depauw.edu/files/resources/krathwohl.pdf
  8. Gierl, M. J., & Lai, H. — "Automatic Item Generation" (and item development for assessment) — Educational Measurement / Routledge handbooks, 2013 onward. The template-based AIG discipline that LLM generation now complements; the source for the calibration-versus-speed trade-off. Tier 5. https://doi.org/10.4324/9780203803912
  9. OpenAI / Anthropic model documentation on hallucination and retrieval-augmented generation — vendor engineering documentation, 2024–2026. The mechanism by which RAG anchors generation to source text and reduces, but does not eliminate, fabricated answer keys. Tier 4. https://platform.openai.com/docs/guides/retrieval

Where sources disagreed, the standards win: vendor claims that AI "instantly creates ready-to-use quizzes" were overridden by the 1EdTech AI-Generated Content Best Practices (ref 3) and the multiple-choice item-writing literature (ref 6), which together establish that generated items require human review of accuracy, distractors, and cognitive level before they are valid to deploy.