VRAM is the fast memory attached to a GPU. The model's weights and its temporary computations must fit inside it; exceed the limit and the model will not load. Quantization shrinks weights specifically to fit larger models into smaller VRAM.
Definition
The memory on a GPU. A model only runs if its weights plus working data fit in VRAM, so VRAM size caps which models a given GPU can serve.
VRAM is the fast memory attached to a GPU. The model's weights and its temporary computations must fit inside it; exceed the limit and the model will not load. Quantization shrinks weights specifically to fit larger models into smaller VRAM.
Also known as
GPU memory