Quantization

Definition

Shrinking a model by storing its weights at lower numeric precision, so it uses less memory and runs faster, usually with little accuracy loss.

Quantization replaces high-precision numbers with smaller ones — for example 16-bit to 8-bit or 4-bit — cutting VRAM and speeding inference. It is the main way to fit larger models onto smaller GPUs and edge devices; aggressive levels trade some accuracy.

Also known as

model quantization, INT8, 4-bit