Edge Compression — Distillation + Quantization Decision Checklist

One page: how to shrink AI models to run on the edge for a video product. The two levers (distillation trains a smaller student model; quantization stores the same model's numbers in fewer bits) and how they compound; the bit-width memory math for a 7-billion-parameter model (28 / 14 / 7 / 3.5 GB at FP32 / FP16 / INT8 / INT4) against an 8 GB edge ceiling; the six quantization methods (INT8 + calibration, GPTQ, AWQ, GGUF Q4_K_M, FP8, NVFP4) and what each is for; PTQ-first / QAT-if-needed; the per-stage decision for a video pipeline (detection on INT8/TensorRT, speech on a distilled-then-quantized model, language/VLM on 4-bit AWQ/GGUF); and the questions to ask before you compress, including re-measuring accuracy on your own footage.

Download free PDF

PDF

Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.

+1 (914) 775-5855
New York · USA
© Fora Soft, 20052026
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.