A transcript is the complete written representation of the spoken audio in a learning video, with each word or sentence paired with its start and end timestamps so the text can be synchronised with playback. Transcripts differ from captions in purpose and granularity: captions are a real-time accessibility aid displayed line by line during playback, while a transcript is a document-like artefact that a learner can read in full, search, copy from, or annotate independently of the video. The dominant interchange format is WebVTT (Web Video Text Tracks), which encodes timestamps and text in a plain-text file supported natively by HTML5 browsers; SRT (SubRip Text) is an older but widely supported alternative. Transcripts are most commonly generated by ASR (Automatic Speech Recognition) and then reviewed by a human editor to correct proper nouns, technical terms, and punctuation before publication. A word-level timestamped transcript is the foundation for in-video search: when a learner types a keyword, the search engine looks up matching words in the transcript index and returns the video position of each hit, letting the player jump directly to that moment. Transcripts also enable learner annotations anchored to text passages, cognitive load reduction by letting learners read ahead, and localization pipelines where machine translation produces a draft in another language. The practical trade-off is accuracy: ASR error rates rise significantly with accented speech, technical vocabulary, or low-quality audio, and uncorrected transcripts degrade both search quality and learner trust.