Quantization replaces high-precision numbers with smaller ones — for example 16-bit to 8-bit or 4-bit — cutting VRAM and speeding inference. It is the main way to fit larger models onto smaller GPUs and edge devices; aggressive levels trade some accuracy.