Captions are time-stamped text renderings of the spoken audio track in a video, displayed synchronously inside or below the player frame so that learners can read what is being said. WCAG 2.1 AA Success Criterion 1.2.2 mandates captions for all pre-recorded synchronised media, making them a legal requirement on most corporate and public-sector learning platforms. Captions differ from subtitles in that they also convey non-speech audio information such as "[alarm sounds]" or "[music]" that is relevant to understanding the content. In practice, captions are generated in two ways: automatically via ASR (Automatic Speech Recognition) engines such as Whisper or vendor APIs, then human-reviewed for accuracy; or produced manually from a transcript. Common delivery formats are WebVTT for web players and SRT for offline or upload workflows; both are plain-text files pairing timecodes with caption text. Beyond accessibility, captions drive measurable engagement: learners watch longer, comprehension improves for non-native speakers, and the text is indexed for in-video search. The main quality risk with auto-generated captions is technical vocabulary errors; a human review step is essential before any captions are considered accessible. Caption styling — font size, contrast, and placement — is itself subject to WCAG colour-contrast requirements.

