Multimodal AI fuses signals that used to need separate systems, so a single model can watch a clip and answer questions about it in words. It is the basis of video understanding, visual search, and assistants that 'see' as well as read.
Definition
AI that handles more than one kind of input together — images or video plus text or audio — and reasons across them in one model.
Multimodal AI fuses signals that used to need separate systems, so a single model can watch a clip and answer questions about it in words. It is the basis of video understanding, visual search, and assistants that 'see' as well as read.
Also known as
multimodal models, multi-modal