VLM (Vision-language model)

Definition

A model that takes images or video plus text and produces text — describing, answering questions, or reasoning about what it sees.

A vision-language model joins a visual encoder to a language model so you can ask plain-language questions about a picture or clip and get a written answer. VLMs increasingly replace custom detectors for tasks where flexibility beats peak speed.

Also known as

vision-language model, VLMs

Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.

Knowledge base

Blog Guides Courses Glossary Downloads

Company

Services Projects Demos Calculator Contacts

+852-8193-2621

Hong Kong

+1 (914) 775-5855

New York · USA

eager2develop@forasoft.com

Your message has been sent successfully

We will contact you soon

Message not sent. Please try again.

VLM (Vision-language model)

Related terms

Multimodal AI

LLaVA

Qwen-VL

CLIP