ASR (Automatic Speech Recognition) converts spoken audio into text, and in the learning-video context it is the engine behind auto-captions, searchable transcripts, lecture summarization, and quiz generation from recorded content. Modern transformer-based ASR models such as OpenAI Whisper reach word-error rates competitive with human transcription on clear speech, though accuracy drops with accents, domain jargon, and noisy audio. The output is a time-stamped transcript where each word is anchored to a position in the video, enabling in-video search, chaptering suggestions, and the engagement analytics that depend on knowing exactly what was said when. For accessibility, WCAG requires synchronized captions for all pre-recorded video; ASR makes producing a first draft fast but human review remains necessary to meet the accuracy standard required for legal compliance. ASR feeds other AI features in the pipeline: the transcript goes to a summarizer, a quiz generator, or a RAG index for an AI tutor. Language model-based ASR systems can also produce punctuated, paragraph-structured output rather than a raw token stream, which reduces the editing burden downstream. A practical gotcha is speaker diarization — attributing words to individual speakers in a multi-person recording — which most general-purpose ASR models handle imperfectly and which matters for virtual classroom recordings where instructor and learner turns must be distinguished.

