A vision-language model joins a visual encoder to a language model so you can ask plain-language questions about a picture or clip and get a written answer. VLMs increasingly replace custom detectors for tasks where flexibility beats peak speed.
Definition
A model that takes images or video plus text and produces text — describing, answering questions, or reasoning about what it sees.
A vision-language model joins a visual encoder to a language model so you can ask plain-language questions about a picture or clip and get a written answer. VLMs increasingly replace custom detectors for tasks where flexibility beats peak speed.
Also known as
vision-language model, VLMs