LLaVA pairs a visual encoder with an open language model so it can describe images and answer visual questions, and its open weights make it a frequent starting point for teams building or fine-tuning their own VLM features.
Definition
A widely used open vision-language model that connects an image encoder to an open LLM, a common base for self-hosted multimodal features.
LLaVA pairs a visual encoder with an open language model so it can describe images and answer visual questions, and its open weights make it a frequent starting point for teams building or fine-tuning their own VLM features.
Also known as
LLaVA-NeXT