Building a Learning Analytics Pipeline: xAPI to Warehouse

This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.

Why this matters

If you are building a learning product or owning its roadmap, the analytics pipeline is the difference between "we think the course works" and "Module 3 loses 60% of learners at the seven-minute mark, here is the fix." Most teams buy a learning system, accept its canned reports, and discover a year later that the one number they need to prove training worked was never captured — and cannot be recovered. This article is the build guide: it shows the five layers, the schema and standards that carry the data, the sizing math that keeps storage sane, and the privacy controls that keep learner data lawful. It is written so a non-technical product owner can scope the work and brief engineers, and so the engineers reading over their shoulder find the spec citations accurate. It is the deep build companion to Learning Analytics: The Metrics That Matter, which covers which metrics to chase; this one covers how to make them exist.

The pipeline in one picture

Before any code, hold the whole system in your head. A learning-analytics pipeline is the path a single fact — "a learner watched seconds 0 to 95 of the Module 3 video, then jumped back to second 70" — travels from the moment it happens to the moment it changes a decision. That path has five layers, and naming them is the first act of building well, because each layer has a different owner, a different failure mode, and a different cost.

Capture is the player emitting an event the instant something happens. Transport is getting that event off the device and into your systems reliably, in batches, without flooding the network. Store is the Learning Record Store — the system of record that receives and keeps every event. Warehouse is where raw events are reshaped and joined to the rest of the business so they can be queried fast. Serve is the dashboard, report, or API that turns a stored fact into something a human acts on.

A five-layer learning-analytics pipeline build diagram: capture, transport, store in an LRS, warehouse, and serve, with the standard labelled on each link Figure 1. The five layers of a learning-analytics pipeline. The standard that carries the data is labelled on each arrow; the store and warehouse layers (tinted) are where learner data lives and where privacy controls bite.

We sketched this map at a high level in Learning Analytics: The Metrics That Matter. The rest of this article builds each layer for real — what to put in it, what it costs, and where it breaks.

Decide before you build: do you even need a custom pipeline?

The honest first question is build-vs-buy, and most teams over-build. An off-the-shelf learning-management system — the platform that hosts courses and learners, abbreviated LMS — ships with reporting that is free and good enough for compliance: who completed what, and when. If your only job is to prove that 800 employees finished the safety onboarding course, you do not need a custom analytics pipeline; you need to configure the LMS you already have.

You need a custom pipeline when three things are true at once: you have video or interactive content whose per-second behaviour you must understand, the questions you will ask are not on any vendor's menu (custom mastery rules, drop-off heatmaps, learning data joined to a sales number), and the answers will change product or content decisions often enough to pay for the engineering. If any one of those is false, buy. If all three hold, the rest of this article is your blueprint. We work the dollar side of this trade-off in The Learning-Platform Cost Model; the rule of thumb is that the pipeline is justified by the decisions it unlocks, not by the dashboard it produces.

A build-versus-buy decision tree for a learning-analytics pipeline, routing from off-the-shelf LMS reporting to a custom xAPI pipeline based on video depth, custom questions, and decision frequency Figure 2. When a custom pipeline pays for itself. Three conditions must hold together; if any fails, configure an off-the-shelf system instead.

Layer 1 — Capture: design the event schema first

Here is the rule that governs everything downstream: your analytics are capped at capture. If the player never emits "the learner jumped back ten seconds," no warehouse, model, or dashboard can ever show you re-watch behaviour. The data you do not capture this quarter is gone permanently. So the cheapest, highest-impact moment in the whole project is before the player is built, when you decide what it will say.

What it says is governed by a standard — a shared grammar so any system can read the events. For learning, the dominant grammar is xAPI, the Experience API (its old project name was "Tin Can API"; the standard is xAPI). An xAPI event is a statement shaped like a plain sentence: an actor, a verb, and an object — "Maria — completed — Module 3." That three-part core is required by the specification [1]. The standard is now ratified as IEEE 9274.1.1-2023, informally called xAPI 2.0, published in October 2023 [2]. We explain the standard end to end in xAPI (Tin Can) Explained; here we use it to build.

A statement carries more than the sentence. Two optional sections do the analytic heavy lifting. The result holds what came of the action — a score, a completion flag, a duration, and for video the precise extensions we need. The context holds the circumstances — which course, which attempt, which session. A minimal video statement, with obviously fake data, looks like this:

{
  "actor": {
    "objectType": "Agent",
    "account": { "homePage": "https://lms.example.edu", "name": "learner-8a3f" }
  },
  "verb": {
    "id": "https://w3id.org/xapi/video/verbs/seeked",
    "display": { "en-US": "seeked" }
  },
  "object": {
    "id": "https://courses.example.edu/modules/3/video",
    "objectType": "Activity"
  },
  "result": {
    "extensions": {
      "https://w3id.org/xapi/video/extensions/time-from": 95.0,
      "https://w3id.org/xapi/video/extensions/time-to": 70.0
    }
  },
  "context": {
    "registration": "8f14e545-...",
    "extensions": {
      "https://w3id.org/xapi/video/extensions/session-id": "sess-44b"
    }
  },
  "timestamp": "2026-06-21T10:15:32.000Z"
}

Notice three deliberate choices. First, the actor is identified by an account, not an email. xAPI lets you name a learner by their email address (mbox), a hashed email, an OpenID, or an opaque account on a system [1]. Using an opaque account name like learner-8a3f instead of maria@company.com is your first and cheapest privacy control — more on that in Layer 5. Second, the verb and the video extensions are URLs from the xAPI Video Profile, the community profile that defines the exact vocabulary a player should emit for video: the verbs initialized, played, paused, seeked, completed, terminated, and the result extensions that matter most, progress (how far through, 0 to 1) and played-segments (the list of intervals actually watched) [3]. Third, the registration ties every statement from one attempt together; the Video Profile aggregates progress, played-segments, and completion by that registration value [3]. Build the player to set it once per attempt and you get clean per-attempt analytics for free.

Anatomy of an xAPI video statement, breaking out the required actor, verb, and object and the optional result and context, with the Video Profile played-segments and progress extensions highlighted Figure 3. The anatomy of an xAPI statement. The actor-verb-object core is required; the result and context carry the video-specific data — played-segments and progress — that drop-off and re-watch analytics depend on.

Two design disciplines separate a schema you can live with from one you regret. Use stable, resolvable identifiers. Every activity (object.id) and every verb is a URL (technically an IRI), and once statements are stored against an ID, that ID is effectively permanent. Decide your URL scheme — https://courses.example.edu/modules/3/video — and a verb list up front, write it down as a profile, and never let two teams mint two IDs for the same thing. And separate analytics events from resume state. Where the learner left off — the "resume at 4:32" bookmark — is not an analytics event; it is state. The LRS has a dedicated State Resource (the activities/state endpoint) for arbitrary key/value documents like resume position and player preferences [4]. Putting bookmarks there keeps your statement stream clean and your analytics queries fast. The video-specific capture details get their own deep dive in Tracking Video with xAPI.

A word on the alternative. Caliper Analytics, from the standards body 1EdTech, is xAPI's sibling: it defines a MediaEvent for video interactions stamped with the playback time, wrapped in an Envelope that carries a sensor identifier, a send time, a data version, and the batch of events [5]. The standards community frames xAPI and Caliper as "horses for courses," not rivals; Caliper is more prescriptive and common in higher-education systems, xAPI more flexible and common in corporate L&D. The build that follows applies to both — only the JSON dialect changes.

Layer 2 — Transport: batch, sample, and survive the network

Once the player can speak, it must get its sentences to the store without two failures: flooding the network with too many tiny requests, and losing events when the connection drops. Two techniques solve both.

Batch. Do not send one HTTP request per event. The xAPI standard requires an LRS to accept a whole array of statements in a single POST [4]; collect events in a small in-memory queue and flush, say, every few seconds or every twenty events. One nuance to design around: a statement batch is atomic — the specification says the LRS must reject the entire batch if any single statement in it is invalid [4]. So validate before you flush, and do not mix throwaway and critical events in the same batch if a malformed throwaway one could take a completion down with it. Caliper batches the same way: multiple events ride in one Envelope's data array [5].

Sample — this is where most pipelines are won or lost. A naive player fires a "time update" event several times a second. Capture all of them and the volume is staggering. Walk the arithmetic for one 14-minute module and 2,000 learners.

Naive capture, every time-update (~4 per second): 14 min × 60 = 840 seconds × 4 = 3,360 events per learner per module 3,360 × 2,000 learners = 6,720,000 statements for one module At ~1.8 KB per statement [6]: 6,720,000 × 1.8 KB ≈ 12 GB — for one module. A five-module course: ≈ 60 GB per cohort, of mostly worthless telemetry.

Now the sampled design. Emit the action verbs only when the learner acts (played, paused, seeked), add a progress ping every 10 seconds, and send one played-segments summary on pause and on completion.

Sampled capture: 840 s ÷ 10 = 84 progress pings + ~15 action events ≈ ~100 statements per learner per module 100 × 2,000 = 200,000 statements ≈ 200,000 × 1.8 KB ≈ 0.36 GB per module. A five-module course: ≈ 1.8 GB per cohort — about 33× smaller.

The sampled stream answers every real question — the drop-off curve comes from played-segments, re-watch from seeked, completion from progress — at a thirtieth of the storage and network cost. The lesson is not "capture less"; it is "capture the right resolution." Per-frame is noise; per-action-plus-a-heartbeat is signal.

A sampling diagram contrasting a dense raw stream of per-frame time-update events with a sampled stream of action verbs plus a periodic progress heartbeat, showing the storage reduction Figure 4. Sampling at the transport layer. Raw per-frame telemetry (left) is 30-plus times larger than a sampled stream of action verbs plus a progress heartbeat (right) — and answers no question better.

Survive the network. Learners go through tunnels and into elevators. Queue events locally (in the browser or app) and flush when the connection returns. This is safe because of a property the standard guarantees: each statement carries a unique ID, and once the LRS has stored a statement it must not modify it, and it must reject a second statement that reuses an existing ID [4]. That immutability makes retries idempotent — replay your queue after a dropped connection and duplicates are harmlessly refused rather than double-counted. For fully offline learning, statements are held and synced later; we cover that case in depth in the accessibility-and-scale block.

Layer 3 — Store: choosing or building the LRS

The Learning Record Store (LRS) is the system of record — the notebook every xAPI sentence is written into. It is a defined system: a RESTful web service that receives, stores, and returns statements, plus the State, Agent Profile, and Activity Profile resources for documents [4]. That definition is exactly what IEEE 9274.1.1-2023 standardised [2]. An LRS is not your analytics database; it is the durable, append-only ledger that feeds it. Statements in an LRS are immutable by design [4] — which is a feature for trust and audit, and a constraint you must plan around for deletion (Layer 5).

Build-vs-buy applies hard here, and buying is usually right. Writing a conformant LRS means implementing the full statement, state, and document API, the query and pagination rules, and passing the conformance test suite — months of work to reproduce something mature products already do. Build your own only at unusual scale or with an integration no product supports. Otherwise pick one; the realistic options in 2026, with the standards each carries, are below. (A learning-tool comparison table should always show standards support — here it is the point of the table.)

LRS option	Model	xAPI	Caliper	SCORM data	Best fit
Learning Locker (Learning Pool)	Open source + commercial cloud	Yes	Limited	Via wrapper	Self-host, full data control, most-deployed open source
SQL LRS (Yet Analytics)	Open source	Yes	No	No	Teams pairing the LRS with a SQL warehouse
Veracity Learning	Commercial, free tier	Yes (incl. 2.0)	No	No	Small projects; xAPI attachments and signatures
Watershed	Commercial	Yes	Yes	Imports	Strong built-in analytics and visualisation
Build your own	Custom	Implement spec	Optional	n/a	Unusual scale or integration only

Open-source options (Learning Locker, SQL LRS) trade a hosting and maintenance burden for full control and no per-record fee; commercial options (Watershed, Veracity) trade a subscription for managed scale and, in Watershed's case, a strong analytics layer you might otherwise build in Layer 5. The SCORM column matters because legacy courses may still emit the older SCORM standard, which records only a fixed data model — completion, score, time — inside the LMS and treats the course as a black box [7]. SCORM cannot carry per-second video data; if your roadmap promises a heatmap, you need the xAPI Video Profile or Caliper, decided here at the store layer. We compare the standards head to head in SCORM vs xAPI vs cmi5 vs LTI.

Layer 4 — Warehouse: model the data so it answers fast

An LRS is excellent at one job — reliably storing and returning statements — and poor at another: answering "show me median drop-off by module, by cohort, for last quarter, joined to each learner's department." Statements are deeply nested JSON, one row per event; analytics wants flat, indexed tables. So a real pipeline copies data out of the LRS into a data warehouse, a database shaped for fast queries across large, joined datasets.

The modern pattern is ELT — extract, load, transform — not the older ETL. You extract statements from the LRS (most expose a query API or a periodic export), load the raw JSON into the warehouse untouched, then transform it inside the warehouse into clean tables [8]. Loading raw first means you never lose fidelity and can re-transform when a new question appears — which it always does.

The transform reshapes nested statements into a star schema: one central fact table of events surrounded by dimension tables you join to it. A learning warehouse typically has a fact_learning_event table (one row per statement: who, what verb, which activity, score, progress, timestamp, registration) joined to dim_learner, dim_activity, dim_course, and dim_cohort. This is the layer where business questions become answerable, because here — and only here — learning data sits next to HR, CRM, or operations data. Joining a completion to a sales figure or a safety-incident rate happens in the warehouse, not the LRS. That join is what lets training prove a business outcome rather than just an activity.

An ELT and star-schema diagram showing raw xAPI statements loaded from the LRS into a warehouse, then transformed into a central fact_learning_event table joined to learner, activity, course, and cohort dimensions Figure 5. The warehouse layer. Raw statements are loaded as-is, then transformed into a star schema — a central event fact table joined to learner, activity, course, and cohort dimensions — and joined to business data for outcome analytics.

Two practical moves keep the warehouse fast and cheap. Pre-aggregate the heavy stuff. Per-second played-segments are huge; compute the drop-off curve once per video per cohort into a small rollup table, and dashboards read the rollup, not the raw segments. Keep raw and modelled separate. Raw loaded statements are your immutable source of truth; modelled tables are disposable and rebuildable. When you fix a transform bug, you rebuild the modelled layer from raw — you never patch numbers by hand.

Layer 5 — Serve, and the privacy that wraps the whole pipeline

The final layer turns a stored fact into a decision: a learner sees progress and a next step; an instructor sees cohort drop-off and item difficulty; the business sees completion-as-evidence and, where justified, outcome links. The discipline is to show each audience only the metric tied to their decision — a dashboard that visualises everything and recommends nothing is the most common way analytics dies. Designing reports people actually use is its own craft, covered in Reporting to Stakeholders.

Privacy is not a layer at the end; it is a constraint you wire into every layer, and because learner records are personal data, it carries legal weight. In the European Union, learner data falls under the General Data Protection Regulation (GDPR, Regulation (EU) 2016/679); United States student records fall under FERPA, the Family Educational Rights and Privacy Act; and biometric signals, if you ever capture them, invoke laws like Illinois's BIPA. Four engineering controls keep you on the right side of all three.

Minimise at capture. GDPR makes data minimisation an explicit, enforceable requirement — collect only what a defined purpose needs [9]. This is why Layer 1 used an opaque account ID, not an email: a pseudonymous actor is far lower risk. Caliper's own guidance echoes this, urging you to keep contextual data in each event minimal [5]. Pseudonymise, and keep the key separate. Under GDPR, pseudonymised data is still personal data but much lower risk [9]; store the map from learner-8a3f to a real identity in a separate, tightly controlled system, so the analytics warehouse holds no direct identifiers. Plan deletion against immutability. The LRS will not let you edit a statement [4], so honouring a deletion request means voiding or purging by actor — design that operation on day one, not when the first erasure request arrives. Set retention before you collect, not after. Decide how long each class of data lives and automate its expiry; "keep everything forever" is a liability, not an asset.

This article only sketches the legal landscape; the full treatment, including consent, lawful basis, and cross-border transfer, is in Proctoring Data, Privacy, and the Legal Landscape. Treat it as required reading before you wire learner data into a warehouse.

A common mistake that costs a re-build

The defining failure in this whole pipeline is instrumenting late and coarsely. A team ships the player emitting only "completed," launches, and three months in the product owner asks "where do learners drop off in Module 3?" — and the answer is gone, because the player never emitted the played-segments that would show it, and you cannot backfill events that were never recorded. The fix is a re-instrumentation and a wait for a fresh cohort, costing a quarter. The cure is cheap and must happen at the start: design the rich event schema in Layer 1, emit it from the first release even before a dashboard exists to read it, and store it. Storage is cheap; missing history is priceless. The mirror-image mistake — capturing every per-frame event and drowning in 60 GB of noise — is the same failure of planning from the other side, and Layer 2's sampling is its cure.

Where Fora Soft fits in

Fora Soft has built video streaming, real-time WebRTC, and interactive-player software since 2005, and in e-learning the analytics work is rarely the dashboard at the end — it is wiring the player at Layer 1 so the right events, in the right standard, at the right sampling, flow into a store and warehouse that can answer the questions you will actually ask. The build-vs-buy trade-off we help teams navigate is exactly the one in this article: an off-the-shelf LMS gives canned reporting for free, but custom drop-off heatmaps, custom mastery rules, and learning data joined to business outcomes require a custom capture layer plus an xAPI or Caliper pipeline and a modelled warehouse on top. We help decide which questions justify that layer, then build the instrumentation so the analytics become possible at all. The same capture-transport-store pattern underpins the real-time and streaming work we do in conferencing, telemedicine, OTT, and surveillance.

Call to action

Talk to a e-learning engineer — book a 30-minute scoping call to talk through your learning analytics pipeline plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Learning-Analytics Pipeline Build Checklist — A one-page build checklist covering event-schema design, sampling, the LRS choice, the warehouse model, and the privacy controls to wire in from day one.

References

ADL Initiative. Experience API (xAPI) Specification v1.0.3 — Part 2: Statements (Data). https://github.com/adlnet/xAPI-Spec/blob/master/xAPI-Data.md — Tier 1 (primary standard). The required actor-verb-object statement model; actor identification by mbox, mbox_sha1sum, openid, or account; result and context structure.
IEEE. 9274.1.1-2023 — Standard for Learning Technology: JSON Data Model Format and RESTful Web Service for Learner Experience Data Tracking and Access. https://standards.ieee.org/ieee/9274.1.1/7321/ — Tier 1 (primary standard). The ratified IEEE form of xAPI, informally "xAPI 2.0," published 2023-10-10; defines the LRS as the system exposing the RESTful API.
ADL / xAPI Video Community Profile. xAPI Video Profile. https://github.com/adlnet/xapi-authored-profiles/tree/master/video — Tier 1 (primary profile). Verbs initialized/played/paused/seeked/completed/terminated; result extensions progress and played-segments; aggregation by registration.
ADL Initiative. Experience API (xAPI) Specification — Part 3: Communication (Statement Resource, State Resource, batch and immutability rules). https://github.com/adlnet/xAPI-Spec/blob/master/xAPI-Communication.md — Tier 1 (primary standard). LRS must accept statement batches (arrays); must reject a batch if any statement is rejected; must not modify a stored statement; State Resource for documents.
1EdTech (formerly IMS Global). Caliper Analytics 1.2 Specification and Implementation Guide (MediaEvent, Envelope, data-minimisation guidance). https://www.imsglobal.org/spec/caliper/v1p2 — Tier 1 (primary standard). Envelope requires sensor, sendTime, dataVersion, data; data may batch multiple events; keep contextual entities minimal to limit payload size.
Rustici Software / SCORM.com. Tracking a video using xAPI (statement-size guidance; ~1.8 KB per SCORM-profile statement). https://support.scorm.com/hc/en-us/articles/206165016-Tracking-a-video-using-xAPI — Tier 4 (first-party engineering source). Used only for the order-of-magnitude statement size in the sizing example.
ADL Initiative. SCORM 2004 4th Edition — Run-Time Environment (RTE). https://adlnet.gov/projects/scorm/ — Tier 1 (primary standard). Fixed data model (completion status, success status, score, time); course tracked as a black box inside an LMS launch; cannot carry per-second video data.
Matillion. ETL Architecture and Design: patterns for modern data pipelines (ELT vs ETL; staging → modelled warehouse). https://www.matillion.com/blog/etl-architecture-design-patterns-modern-data-pipelines — Tier 6 (vendor engineering reference). Orientation for the load-raw-then-transform pattern; not a learning-standards source.
Regulation (EU) 2016/679 (GDPR), Article 5(1)(c) data minimisation, Article 4(5) pseudonymisation, Article 17 right to erasure. https://eur-lex.europa.eu/eli/reg/2016/679/oj — Tier 1 (primary law). Minimisation is an enforceable principle; pseudonymised data remains personal data; data subjects have a right to erasure.

Where sources disagreed, the official specifications were followed. Many vendor articles claim "SCORM tracks everything"; this article follows SCORM 2004's fixed, black-box data model [7] and shows that per-second video analytics require the xAPI Video Profile [3] or Caliper [5]. The statement-size figure [6] is a vendor estimate used only for order-of-magnitude arithmetic, explicitly labelled as approximate, not a standards claim.

Building the Learning-Analytics Pipeline

Why this matters

The pipeline in one picture

Decide before you build: do you even need a custom pipeline?

Layer 1 — Capture: design the event schema first

Layer 2 — Transport: batch, sample, and survive the network

Layer 3 — Store: choosing or building the LRS

Layer 4 — Warehouse: model the data so it answers fast

Layer 5 — Serve, and the privacy that wraps the whole pipeline

A common mistake that costs a re-build

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Building the Learning-Analytics Pipeline

Why this matters

The pipeline in one picture

Decide before you build: do you even need a custom pipeline?

Layer 1 — Capture: design the event schema first

Layer 2 — Transport: batch, sample, and survive the network

Layer 3 — Store: choosing or building the LRS

Layer 4 — Warehouse: model the data so it answers fast

Layer 5 — Serve, and the privacy that wraps the whole pipeline

A common mistake that costs a re-build

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

SCORM

xAPI Video Profile

Learning analytics

Re-watch

E-learning

Mastery

Offline learning

xAPI statement