A vision-language model joins a visual encoder to a language model so you can ask plain-language questions about a picture or clip and get a written answer. VLMs increasingly replace custom detectors for tasks where flexibility beats peak speed.