Localization in learning video is the process of adapting course content and platform UI so that learners in a target locale can engage with it in their own language and cultural context. For video, localization has three distinct tiers of depth: caption-only localization (translate and synchronise the caption file, leaving the original audio track unchanged), dubbing (replace or overlay the narration audio with a target-language voice), and AI-avatar re-generation (re-synthesise the on-screen presenter speaking the target-language script). Caption-only is fastest and cheapest and suits self-paced courses where learners are comfortable with foreign-language audio; dubbing gives a more natural experience but requires voice talent or text-to-speech per language; re-generated AI avatars are the most seamless but require a synthesis platform and quality review. Machine translation (MT) using models such as DeepL or GPT can produce a working first draft of captions quickly, but post-editing by a subject-matter-familiar human is essential for technical vocabulary accuracy. Localization also encompasses UI strings, date and number formats, right-to-left layout support for Arabic and Hebrew, and locale-appropriate imagery. At scale — deploying a course to ten or twenty locales — the per-locale video storage and CDN egress multiplies accordingly, so bitrate ladder optimisation and shared asset caching become more important. Localization interacts with accessibility: each localised caption file must also meet WCAG caption requirements for the target language.