This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.
Why this matters
If you build or run a learning product, your engagement dashboard probably leads with watch-time and completion — and both can be high while learning is low. A learner can stream a lecture to the end in a background tab and absorb nothing. Interaction signals are the cheapest reliable evidence you have that thinking actually happened, and the research is consistent that active behaviours predict learning better than time-on-task alone. This article is the active-learning companion to Video Engagement: Watch-Time, Drop-Off, and Re-Watch and to Learning Analytics; read it before you decide which events your player and platform will emit, because retrofitting interaction tracking after launch means re-shipping the player.
Watch-time proves playback, not thinking
Start with the metric every learning product reports and the reason it is not enough. Watch-time — the amount of a video a learner plays — is covered in depth in its own article, Video Engagement, and it is genuinely useful for diagnosing content: where people drop off, what they re-watch. But as evidence that learning happened, it is weak, because it measures exposure, not effort. Playing a video is something that happens to a learner; learning is something a learner does.
The gap is easy to make concrete. Imagine two cohorts watch the same ten-minute lesson, and both reach about 95% average watch-time — by the watch-time number, identical engagement. Cohort A only watched. Cohort B answered three short questions embedded in the video and wrote a few notes in their own words. On the post-test, cohort A averages 68% and cohort B averages 90%. That is a 22-point gap (90 − 68 = 22) that the watch-time metric could not see, because watch-time was blind to the one thing that differed: what each cohort did while the video played. Those post-test figures are not invented — they track a well-known experiment on testing during lectures that we will return to below [2].
So the job of this article is to define the signals that do see that difference — the interaction signals — and to give them an order, because not all interactions are equal.
The active-learning ladder: a framework for ranking signals
To rank interaction signals you need a theory of why some interactions help more than others, and the most useful one in learning science is the ICAP framework, published by Micki Chi and Ruth Wylie in 2014 [1]. ICAP stands for the four modes it names — Interactive, Constructive, Active, and Passive — and its core idea is exactly what a product team needs: it classifies engagement by the learner's overt behaviour, the observable thing they do, not by an unmeasurable inner state [1]. Overt behaviour is precisely what software can capture as an event, which is why a forty-year-old idea from cognitive science turns out to be the right schema for an analytics layer.
The framework defines four modes, each a rung on a ladder, characterised in a single word [1]:
- Passive — receiving. The learner is oriented toward the material and taking it in, without doing anything else: watching a video, listening to a lecture. This is the watch-time rung.
- Active — manipulating. The learner performs some physical action that selects or manipulates the content: pausing, rewinding and re-watching a segment, highlighting a sentence, taking verbatim notes that copy what was said.
- Constructive — generating. The learner produces something that goes beyond the material as given: notes in their own words, a self-explanation, a concept map, a question about what they did not understand.
- Interactive — dialoguing. Two learners (or a learner and a capable tutor) take turns building on each other's contributions, so that ideas emerge that neither held alone — a substantive discussion, peer instruction, a debate.
The ICAP hypothesis is that learning rises as you climb: Interactive > Constructive > Active > Passive [1]. Chi and Wylie support it with a sweep of laboratory and classroom studies; the practical reading for us is the ranking itself. It tells you that a learner who only watches (Passive) sits on the lowest rung, that a re-watch or a highlight (Active) is a step up, that a note written in the learner's own words or a question asked (Constructive) is higher still, and that a genuine discussion (Interactive) is the top. Your interaction signals should be weighted the same way — a discussion reply is worth more as evidence of learning than a reflexive pause.
Figure 1. The active-learning ladder. Learning rises from Passive to Interactive; each rung names the learner behaviour, a concrete video-learning example, and the signal a platform can capture.
One nuance keeps teams honest. A mode is defined by what the learner does with the content, not by the feature's label. Note-taking can be Active (copying a slide verbatim) or Constructive (rephrasing the idea in your own words) — the same feature, two different rungs [1]. A quiz can be Active (recognising the right answer from four options) or push toward Constructive (a short written explanation). This is why the kind of interaction matters more than the count, a point the next section makes its central warning.
The signals worth capturing, and what each predicts
With the ladder as a guide, here are the interaction signals a learning-video product can realistically capture, roughly from lower to higher on ICAP. Each is an overt behaviour, each maps to a mode, and each predicts something different.
In-video controls: pause, rewind, re-watch, speed. These are Active-mode signals — the learner is manipulating the content. A pause-and-rewind cluster on one segment is the re-watch hotspot covered in Video Engagement: it flags either a valuable moment or a confusing one. On its own a rewind is a weak learning signal, but as context next to a wrong answer it becomes a strong confusion signal [10].
Quiz and in-video question attempts. This is the highest-value signal most products can capture cheaply, because answering a question is retrieval practice — the act of pulling an answer from memory, which strengthens it. The number to remember comes from a 2013 experiment by Szpunar, Khan, and Schacter: students who got short tests interspersed through an online lecture scored 90% on a final test, against 76% for a group that simply re-studied the same material and 68% for a group that was not tested at all [2]. The tested group also reported less mind-wandering and took more notes [2]. Brame's review of educational video reaches the same conclusion: interpolated questions are one of the most reliable ways to convert passive watching into active learning [3]. Capture not just whether a quiz was attempted, but the response, whether it was correct, and how long it took.
Note-taking and annotation. Notes, highlights, and bookmarks are the classic Active-or-Constructive signal. The distinction is the payoff: a highlight or a verbatim note is Active; a note written in the learner's own words is Constructive and predicts deeper understanding [1]. A product can nudge learners up the ladder by prompting "in your own words" rather than offering only a highlighter. The annotation features themselves are covered in Notes, Bookmarks, and Learner Annotation.
Question-asking. A learner who types a question has generated something beyond the material — a Constructive act, and a particularly honest signal, because it marks the exact place understanding broke down. Counting questions per lesson, and clustering them by timestamp, points content teams at the segments that need a re-cut far more precisely than a drop-off curve alone.
Discussion and peer interaction. Replies in a discussion thread, peer reviews, and group activities are the Interactive rung — the top of the ladder when the exchange is substantive and turn-taking, not one-line "great post" replies [1]. These are the strongest learning signals you can collect, and the hardest, because quality matters more than count here than anywhere else.
Across large datasets these active behaviours show up as predictors: studies of learning-platform logs repeatedly find that voluntary formative-quiz attempts and discussion participation correlate with course success more reliably than raw time-on-platform [9]. The lesson is not that watch-time is useless; it is that watch-time belongs near the bottom of the ladder, and your dashboard should climb.
Figure 2. The interaction signals worth capturing, each mapped to its ICAP mode, what it predicts about learning, and the standard event used to capture it.
Frequency versus quality: the vanity-metric trap
Here is the mistake that gives this article its title. Having learned that interactions predict learning, teams reach for the easiest version — interaction frequency, a raw count of clicks, pauses, and answers per learner — and treat a bigger number as better. It is the same trap as watch-time, one rung up. A high interaction count can mean deep engagement, or it can mean a learner clicking randomly, re-watching because they are lost, or gaming a participation score.
ICAP is the corrective, because it says depth, not count, is what predicts learning [1]. Walk a concrete comparison. Learner X pauses the video eight times and re-watches two segments: ten Active-mode events, an impressive-looking tally. Learner Y pauses twice, but writes two notes in their own words and answers three in-video questions: five events, but three of them Constructive and grounded in retrieval. By a frequency count, X "interacted" twice as much (10 events versus 5). By ICAP, Y did the work that predicts learning. A dashboard that ranks X above Y is measuring motion, not thought.
The fix is to weight signals by mode rather than summing them flat. A simple, defensible scheme assigns each captured event a weight — say Passive 0, Active 1, Constructive 3, Interactive 5 — and reports the weighted total alongside the raw count, so a learner who self-explains and discusses outranks one who only clicks. The exact weights are yours to tune; the principle is fixed. Report the composition of interaction, not just its volume: "how much of this learner's engagement was Constructive or above?" is the question that matches the science. And never let an interaction count become a grade input without a quality check — that is the surest way to teach learners to click for credit.
How to capture interaction signals: the standards
You do not invent an interaction-tracking format; you choose a standard and emit to it. The choice determines what "interaction" can ever mean in your product, so make it before you build the player. Four standards matter here, and they are not interchangeable.
xAPI (the Experience API). This is the modern standard for recording learning experiences as simple sentences. An xAPI statement reads actor – verb – object: "Maria answered Question 3," "Maria commented on Lesson 4." Those sentences are written into a Learning Record Store (the LRS, the database that holds xAPI statements). xAPI is the right tool for interaction signals because it can express any verb — answered, commented, asked, noted, interacted — not a fixed list. A precise point most summaries get wrong: the xAPI specification itself reserves only one verb, voided (used to cancel a mistaken statement); every other verb, including answered and interacted, comes from controlled vocabularies maintained by ADL and communities of practice, not from the core spec [4]. That openness is the feature — it lets you model question-asking and discussion that no older standard anticipated.
Here is the whole mechanism in one statement — a learner answering an in-video question, with the result and the timestamp that make it a usable signal:
{
"actor": { "mbox": "mailto:learner@example.org", "name": "Sample Learner" },
"verb": {
"id": "http://adlnet.gov/expapi/verbs/answered",
"display": { "en-US": "answered" }
},
"object": { "id": "https://courses.example.org/video/42#q3" },
"result": { "success": true, "response": "B", "duration": "PT8S" },
"context": { "registration": "f1e2d3c4-0000-4a5b-8c9d-000000000001" }
}
SCORM. The older packaging-and-tracking standard, covered in SCORM Explained, can track interactions — its run-time data model has a cmi.interactions collection that records quiz items with a type (choice, fill-in, performance, likert, and a handful more), the learner's response, whether it was correct, and the latency before answering [5]. So SCORM handles the quiz-attempt signal adequately inside an LMS launch. What it cannot do is model the open-ended signals — a free-form note, a learner's typed question, a discussion reply — because its data model is a fixed, bounded set defined for course-internal interactions, not arbitrary learning events [5]. If quiz answers are all you need, SCORM is enough; the moment you want notes, questions, or discussion as first-class signals, you need xAPI.
IMS Caliper Analytics. A standard from 1EdTech (formerly IMS Global) built specifically for streams of learning events, and a strong fit for interaction signals because it defines them as named event types. Its Annotation Profile models exactly the annotation behaviours we care about: its AnnotationEvent records bookmarking, highlighting, and tagging as distinct actions [6], while its media events cover video controls. Caliper 1.2 is the current version [6]. Many platforms run xAPI and Caliper side by side — Caliper for the high-volume clickstream, xAPI for the pedagogically meaningful statements.
The xAPI Video Profile and cmi5. For interactions that happen inside the video timeline, the xAPI Video Profile standardises the player events, and its played-segments data is the primitive behind re-watch detection — the full treatment is in Tracking Video with xAPI. cmi5 wraps xAPI inside a launchable unit so an LMS can start the activity and still receive the rich statements; it is the bridge when you need both LMS launch control and xAPI's open interaction vocabulary.
Figure 3. From interaction to signal. Learner interactions become xAPI statements, SCORM interactions, or Caliper events, land in the LRS or LMS, and aggregate into the active-learning signals a dashboard reports.
What each standard can prove about interaction
Before you promise a stakeholder an "active learning" report, check that the standard your player emits to can carry the signal. The richness of what you can report is capped by what you capture.
| Interaction signal | Raw player / VOD | SCORM 1.2 / 2004 | xAPI (+ Video Profile) | IMS Caliper 1.2 |
|---|---|---|---|---|
| Watch-time / re-watch | Yes (anonymous) | No (session time only) | Yes (played-segments) |
Yes (media events) |
| Quiz / question attempt + result | No | Yes (cmi.interactions) |
Yes (answered) |
Yes (assessment events) |
| Note / highlight / bookmark | No | No | Yes (noted / custom verb) |
Yes (AnnotationEvent) |
| Learner-asked question | No | No | Yes (asked / custom verb) |
Partial (custom) |
| Discussion / peer reply | No | No | Yes (commented / replied) |
Yes (thread events) |
| Attributed to a named learner | No | Yes (LMS session) | Yes (actor) | Yes (actor) |
| Joinable to completion and score | No | Partial (one SCO) | Yes (any activity) | Yes (any resource) |
The reading is blunt. A raw player gives you watch-time and nothing attributed. SCORM adds quiz interactions tied to a learner but stops there. Only xAPI and Caliper carry the full ladder — notes, questions, and discussion — attributed to a named learner and joinable to their outcomes. Choosing the standard is choosing how high up the ICAP ladder your analytics can ever reach. The full standards comparison lives in SCORM vs xAPI vs cmi5 vs LTI.
Turning signals into action
Signals earn their keep when they change something. Map each pattern to a decision and the analytics become a product instrument, not a wall of charts.
High watch-time but near-zero Constructive signals means learners are watching passively. The fix is to add the interaction that lifts them up the ladder: an in-video question every few minutes, the technique in In-Player Quizzes and Polls, justified directly by the 90%-versus-68% testing result [2]. A cluster of questions or wrong answers at one timestamp is a content defect — open the video there and re-cut, the same editing instruction a drop-off curve gives. Re-watch plus a wrong answer on the same segment is a much louder confusion signal than either alone — prioritise that segment [10]. Lots of verbatim notes but few in-their-own-words notes says learners are copying, not constructing — prompt for rephrasing or a one-line summary. No discussion activity on a cohort course means the top rung is empty; add a prompt, a peer-review step, or a cohort thread.
One lever cuts across all of these: design for the signal you want. If you only build a play button, you can only measure watching. If you build a question, a notes field, and a discussion space, you can measure — and therefore encourage — the behaviours that the ICAP ladder says actually teach [1]. The pedagogy behind chunking video so these interactions fit is in The Pedagogy of Video.
Figure 4. Why an interaction beats passive review. In Szpunar et al. (2013), learners tested during an online lecture scored 90% on the final test, versus 76% for re-studying and 68% for no testing.
Common mistakes
Reporting watch-time as proof of learning. It proves playback. Pair it with at least one Constructive signal — a question answered, a note written — before claiming engagement [1][2].
Summing interaction frequency flat. Ten reflex clicks are not three self-explanations. Weight signals by ICAP mode and report the composition, not just the count [1].
Treating every quiz the same. Recognising one of four options is Active; writing a short explanation pushes toward Constructive. A "harder" question is not pedantry — it moves the learner up the ladder [1].
Trying to get open-ended signals out of SCORM. SCORM's cmi.interactions handles quiz items, not free-form notes, questions, or discussion [5]. If those are on the roadmap, instrument xAPI or Caliper from day one — retrofitting means re-shipping the player.
Turning interaction counts into grades without a quality check. The fastest way to teach learners to click for credit. Keep raw interaction frequency out of the gradebook unless it is quality-weighted.
Forgetting that interaction data is personal data. Every note, question, and answer attributed to a named learner is personal data under regimes such as the EU's GDPR, and a student record under US FERPA where it applies. Collect what you will use, disclose it, and set retention — the same consent discipline raised for attention tracking in Engagement Heatmaps and Attention Analytics.
Where Fora Soft fits in
Fora Soft has built video streaming, real-time WebRTC, and interactive-player software since 2005, and in e-learning the active-learning work is almost always an instrumentation problem before it is a dashboard problem. The build-vs-buy trade-off is usually this: a hosted video platform hands you watch-time for free, but the moment you want to answer "are learners actually doing the work?" you are building an interactive player that emits xAPI or Caliper events — in-video questions, notes, questions, discussion — attributed to a named learner and joined to their outcomes. We help teams decide which signals genuinely justify that custom layer, weight them honestly by depth rather than count, and wire the events so an active-learning report rests on evidence. The same real-time and interactive-video foundations show up across our conferencing, OTT, and telemedicine work.
What to read next
- Video Engagement: Watch-Time, Drop-Off, and Re-Watch
- Learning Analytics: The Metrics That Matter and the Ones That Mislead
- Tracking Video with xAPI: The Video Profile and What to Capture
Call to action
- Talk to a e-learning engineer — book a 30-minute scoping call to talk through your interaction frequency learning plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Active-Learning Signals Instrumentation Checklist — A one-page worksheet from mapping interaction signals to the ICAP ladder, to instrumenting the events with xAPI/SCORM/Caliper, to measuring quality over raw frequency and turning signals into action.
References
- Chi, M. T. H., & Wylie, R. (2014). The ICAP Framework: Linking Cognitive Engagement to Active Learning Outcomes. Educational Psychologist 49(4), 219–243. https://www.tandfonline.com/doi/abs/10.1080/00461520.2014.965823 — Tier 5 (peer-reviewed). Defines four modes of engagement by overt behaviour — Passive (receiving), Active (manipulating), Constructive (generating), Interactive (dialoguing) — and the ICAP hypothesis that learning rises I > C > A > P.
- Szpunar, K. K., Khan, N. Y., & Schacter, D. L. (2013). Interpolated memory tests reduce mind wandering and improve learning of online lectures. PNAS 110(16), 6313–6317. https://www.pnas.org/doi/10.1073/pnas.1221764110 — Tier 5 (peer-reviewed). Learners tested during an online lecture scored 90% on the final test vs 76% (restudy) and 68% (not tested); interpolated testing also reduced mind-wandering and increased note-taking.
- Brame, C. J. (2016). Effective Educational Videos: Principles and Guidelines for Maximizing Student Learning from Video Content. CBE—Life Sciences Education 15(4):es6. https://www.lifescied.org/doi/10.1187/cbe.16-03-0125 — Tier 5 (peer-reviewed). Interpolated questions and other active-learning prompts convert passive watching into engagement and manage cognitive load.
- ADL Initiative. Experience API (xAPI) Specification v1.0.3 — Part 2: Statements (actor-verb-object, the reserved
voidedverb, result, context.registration). https://github.com/adlnet/xAPI-Spec — Tier 1 (primary standard). xAPI reserves only thevoidedverb; all other verbs (e.g.,answered,interacted) come from controlled vocabularies, letting a player record any interaction as a statement. - ADL Initiative. SCORM 2004 4th Edition — Run-Time Environment (
cmi.interactions: type, learner_response, result, latency). https://adlnet.gov/projects/scorm/ — Tier 1 (primary standard). SCORM records a fixed, bounded interactions set (quiz item types, response, result, latency) inside an LMS launch; it has no model for free-form notes, learner questions, or discussion. - IMS Global / 1EdTech. Caliper Analytics 1.2 — Annotation Profile (
AnnotationEvent: bookmarked, highlighted, tagged) and media events. https://www.imsglobal.org/spec/caliper/v1p2 — Tier 1 (primary standard). Caliper models learning interactions as named event types; the Annotation Profile captures bookmarking, highlighting, and tagging as distinct learner actions. - ADL / xAPI Video Community Profile. xAPI Video Profile v1.0 — Statement Data Model (
played-segments,played/paused/seekedverbs). https://github.com/adlnet/xapi-authored-profiles/tree/master/video — Tier 1 (primary profile). Standardises in-video player events;played-segmentsis the primitive behind re-watch detection that contextualises in-video interaction. - ADL Initiative. cmi5 Specification — xAPI inside a launchable Assignable Unit (AU) with
masteryScore. https://github.com/AICC/CMI-5_Spec_Current — Tier 1 (primary standard). cmi5 wraps xAPI in an LMS-launchable unit, combining LMS launch control with xAPI's open interaction vocabulary. - Cohen, A. et al. / ILTA. Using learning analytics to improve online formative quiz engagement. Irish Journal of Technology Enhanced Learning. https://journal.ilta.ie/index.php/telji/article/view/25 — Tier 5 (peer-reviewed). Voluntary formative-quiz attempts and retakes correlate with module engagement and student success more reliably than raw time-on-platform.
- Watershed Systems. How do I author video xAPI statements? https://support.watershedlrs.com/hc/en-us/articles/360022749332 — Tier 4 (LRS vendor / practitioner). Frequent pausing-and-rewinding of a segment, especially near a missed question, often indicates content that is hard to understand and needs rework.
- Guo, P. J., Kim, J., & Rubin, R. (2014). How Video Production Affects Student Engagement: An Empirical Study of MOOC Videos. ACM Learning @ Scale. https://up.csail.mit.edu/other-pubs/las2014-pguo-engagement.pdf — Tier 5 (peer-reviewed). Across 6.9M edX sessions, shorter, single-idea videos held attention better — the chunking that leaves room for interaction.
Where sources disagreed, the learning-science evidence and the official specifications were followed. Many vendor articles treat "engagement" as a single number dominated by watch-time and total interaction count; this article follows the ICAP framework [1] in ranking interactions by depth and the controlled-vocabulary design of the xAPI spec [4] in distinguishing what each standard can actually capture. The common claim that "xAPI has verbs like answered" is corrected to the precise position: the spec reserves only voided, and answered is ADL-vocabulary, not core [4].


