LLaVA pairs a visual encoder with an open language model so it can describe images and answer visual questions, and its open weights make it a frequent starting point for teams building or fine-tuning their own VLM features.