Multimodal AI fuses signals that used to need separate systems, so a single model can watch a clip and answer questions about it in words. It is the basis of video understanding, visual search, and assistants that 'see' as well as read.