One page: how to shrink AI models to run on the edge for a video product. The two levers (distillation trains a smaller student model; quantization stores the same model's numbers in fewer bits) and how they compound; the bit-width memory math for a 7-billion-parameter model (28 / 14 / 7 / 3.5 GB at FP32 / FP16 / INT8 / INT4) against an 8 GB edge ceiling; the six quantization methods (INT8 + calibration, GPTQ, AWQ, GGUF Q4_K_M, FP8, NVFP4) and what each is for; PTQ-first / QAT-if-needed; the per-stage decision for a video pipeline (detection on INT8/TensorRT, speech on a distilled-then-quantized model, language/VLM on 4-bit AWQ/GGUF); and the questions to ask before you compress, including re-measuring accuracy on your own footage.
Download free PDF