Published 2026-06-03 · 25 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
Every AI feature in a video product costs money every time it runs, and on video it runs constantly — a detector looks at thirty frames a second, a transcriber processes every minute of every recording, a moderation model screens every uploaded clip. When those models are full-size and run in the cloud, the bill scales with usage and never stops; when they are small enough to run on the device that captured the video, the marginal cost can fall to near zero, latency drops because nothing travels to a server, and sensitive footage never leaves the building. This lesson is for the product manager, founder, or engineering lead who has to decide whether an AI feature runs in the cloud or on the edge, approve a hardware budget for cameras or phones, or understand why a vendor says a model "won't fit" — and who needs to know what distillation and quantization actually do, what they cost in accuracy, and which one to reach for when. It is the compression companion to the inference-serving lesson: serving makes a model fast on a big GPU, compression makes it small enough to leave the big GPU behind.
Two Ways To Make A Model Smaller
A trained AI model is, underneath, an enormous list of numbers called weights — the values the model learned during training, the things it multiplies an input by to produce an output. A modern model has billions of them. Those numbers are the model's size: store them, load them onto a chip, and multiply by them fast enough, and you have a working model. Make them fewer, or make each one cheaper to store, and the model shrinks. That is the whole subject of this lesson, and there are exactly two levers.
The first lever is quantization: keep every weight, but write each one down using fewer digits. Imagine recording every price in a shop to the exact cent versus rounding each to the nearest dollar — you keep all the prices, the list is just smaller and faster to add up, and you lose a little precision. Quantization does that to a model's numbers, and the precision you lose usually costs less accuracy than you would guess.
The second lever is distillation: throw away the big model's numbers entirely and train a new, smaller model to imitate it. Imagine a master chef who cannot be in every kitchen, so they train an apprentice by having the apprentice watch and copy every dish until the apprentice produces nearly the same food with far less experience. The apprentice is a different, smaller person — not a compressed copy of the master — but trained to behave like them.
Those are the two ideas, and the rest of this lesson is what they cost, how they differ, and why a serious edge plan uses both. The short version, to hold onto from the start: quantization changes how a model's numbers are stored; distillation changes which model you have.
Figure 1. Two levers, not one. Quantization keeps the model and shrinks its numbers; distillation trains a new, smaller model to copy a bigger one. They stack.
Why "Edge," And Why Video Makes It Urgent
The word edge means running the model on or near the device that produced the data — the camera, the phone, the on-premises box in the server closet — rather than sending the data to a model in the cloud. The opposite of edge is the datacenter: a rack of expensive graphics cards you rent by the hour. The choice between them is the subject of the latency and deployment lesson; compression is what makes the edge option possible, because a datacenter-sized model simply does not fit on a camera.
Four pressures push video AI toward the edge specifically, and each one is sharper for video than for text. The first is volume: a video feature does not run once per question like a chatbot — it runs on every frame or every minute of footage, so any per-call cloud cost is multiplied by an enormous number. The second is bandwidth: shipping raw video to a cloud model is slow and expensive, and sometimes there is no connection at all — a body camera, a drone, a factory line. The third is latency: a background-blur or a safety alert that has to feel instant cannot afford a round-trip to a server. The fourth is privacy: medical, surveillance, and classroom footage is far easier to govern when the pixels never leave the device, a point that matters in telemedicine and any regulated vertical.
The catch is hardware. An edge device has a fraction of the memory and power of a datacenter GPU — a typical edge module might have eight gigabytes of memory against a datacenter card's eighty. A model that runs comfortably in the cloud will not even load on the edge. Compression is the bridge: it is how a model trained on big iron is made to run on small iron. And because a video feature is rarely one model but a pipeline of several — decode, detect, transcribe, describe, summarize, the chain described in the serving lesson — the compression job repeats once per stage, each with its own answer.
Quantization: Same Model, Cheaper Numbers
Start with quantization, because it is the lever you reach for first and the one with the cleanest arithmetic. To understand it you need one fact about how a computer stores a number. A weight is normally stored as a floating-point number — a format that can represent a wide range of values with a fractional part, the computer's version of scientific notation. The standard training format uses thirty-two bits per number (called FP32, for thirty-two-bit floating point); a bit is a single 1-or-0, so thirty-two bits is four bytes of storage per weight. Most models today are stored or run at sixteen bits (FP16, two bytes) without anyone noticing a difference.
Quantization pushes further, down to integers — whole numbers with no fractional part — that take far less space. INT8 uses eight bits, one byte, per weight. INT4 uses four bits, half a byte. The trick that makes this work is simple to picture: take the range of values a group of weights actually spans, slice that range into a fixed number of evenly spaced steps (256 steps for eight-bit, 16 steps for four-bit), and store each weight as the number of the nearest step rather than its exact value. Two small bookkeeping numbers — a scale (how big each step is) and a zero-point (which step means zero) — let the chip reconstruct an approximate original value when it needs to do the math. You have traded exactness for size, exactly like rounding prices to the nearest dollar and keeping a note of the rounding rule.
The payoff is direct, because memory scales with bits per weight. Take a model with seven billion weights — a common size for an on-device language or vision-language model. The memory just to hold its weights is the number of weights times the bytes per weight:
FP32 (4 bytes): 7,000,000,000 × 4 = 28 GB
FP16 (2 bytes): 7,000,000,000 × 2 = 14 GB
INT8 (1 byte): 7,000,000,000 × 1 = 7 GB
INT4 (0.5 byte): 7,000,000,000 × 0.5 = 3.5 GB
Read down that column and the edge story tells itself. On an edge module with eight gigabytes of memory, the FP16 version (14 GB) does not load at all; the INT8 version (7 GB) fits but leaves almost no room for the model's working memory; the INT4 version (3.5 GB) fits with space to spare. Quantization is the difference between "this model is impossible here" and "this model runs with headroom." The foundational result, from the 2018 paper that established integer-only inference, is that moving weights and activations to eight-bit integers gives close to a 4× memory reduction versus thirty-two-bit floating point — and on hardware with integer math units, it runs faster too, because moving and multiplying smaller numbers is cheaper.
The fear, reasonably, is accuracy: surely rounding every number degrades the model? It does, but far less than intuition suggests. The well-replicated rule of thumb is that INT8 quantization costs under one percent of accuracy for most models — a loss small enough that users do not notice. Four-bit is more aggressive and needs cleverer methods (below), but modern four-bit techniques land remarkably close to the original. The reason the damage is small is that neural networks are noisy and redundant by nature; they were never relying on the fortieth decimal place of any weight.
The Two Ways To Quantize: PTQ And QAT
There are two moments at which you can quantize a model, and the choice is a recurring decision point, so it is worth naming clearly.
Post-training quantization (PTQ) happens after the model is fully trained: you take the finished model and convert its numbers down, with no further training. To pick good scales, PTQ runs a small sample of representative data through the model once — a step called calibration — to see what range the values actually occupy, then sets the steps to match. PTQ is fast, cheap, and needs no training pipeline or labelled data, which is why it is the default first attempt. Its weakness is that the model never had a chance to adapt to the rounding, so on sensitive models the accuracy loss can be larger than you want.
Quantization-aware training (QAT) happens during training: you simulate the rounding while the model is still learning, so the model sees the quantization error and adjusts its weights to compensate. The result is a model that is more robust to being quantized and recovers most or all of the accuracy that PTQ would have lost. The cost is real, though: QAT needs a training pipeline, the original training data, and GPU time. The standard practice — and the right default — is to try PTQ first, measure the accuracy on your own data, and only reach for QAT if PTQ's loss is too large to ship. Most teams never need QAT for eight-bit; they often do for four-bit on an accuracy-critical model.
Common mistake. Quantizing a model and shipping it without re-measuring accuracy on your data. The "under one percent loss" figure is an average across benchmarks, not a guarantee for your specific model on your specific footage — a detector that loses a fraction of a point of average accuracy might lose far more on the one rare class you actually care about (the weapon, the fall, the defect). Quantization is never "set and forget." Every time you compress, you re-run your evaluation set, the one built in the eval-rig lesson, and you compare the compressed model against the original on the metrics that matter for the job — not on a generic leaderboard. If you cannot measure it, you cannot ship it.
The Quantization Zoo: INT8, AWQ, GPTQ, GGUF, FP8, FP4
"Quantize the model" is not one button; it is a family of methods that differ in how few bits they reach and how cleverly they avoid damage. A video team does not need to implement these, but it needs to recognize the names on a vendor's spec sheet and know what each buys. Six matter in 2026.
INT8 with calibration is the workhorse and the safe default for computer-vision models on the edge. A detector or pose model converted to eight-bit integers through PTQ, then served through NVIDIA's TensorRT (the optimizer covered in the serving lesson), is the routine path to real-time detection on a camera or a Jetson module, with the sub-one-percent accuracy loss above.
GPTQ and AWQ are the two best-known four-bit methods for large language and vision-language models, and they are smarter than plain rounding because they protect the weights that matter most. AWQ — Activation-aware Weight Quantization, which won the MLSys 2024 best-paper award — is built on the observation that a small fraction of weight channels (around one percent) carry most of the model's quality, so it scales and protects those salient channels using statistics about which inputs are large, and rounds the rest hard. The published result is four-bit weights with near-sixteen-bit quality, and because AWQ uses no back-propagation it generalizes cleanly to instruction-tuned and multimodal models — the vision-language models a video team actually deploys. GPTQ reaches similar bit-widths by a different route, correcting the rounding error layer by layer. Either one turns a model that needs 14 GB at FP16 into one that needs around 4 GB at four-bit.
GGUF is the format-plus-method behind the popular llama.cpp engine, and it is how most teams run a quantized language model on a laptop, a Mac, or a small box. Its "k-quant" levels carry suffixes like Q4_K_M — four-bit, mixed precision, keeping the most important layers at higher precision and the rest at four bits, averaging about 4.5 bits per weight. The measured trade-off is excellent: going from eight-bit all the way down to Q4_K_M costs roughly two percent on quality metrics while saving around forty percent of memory, which is why Q4_K_M is the community's default recommendation. The format choices here connect directly to the model-artifact-formats lesson, where GGUF and its alternatives are the procurement decision.
FP8 and FP4 are the newest entries, and they are floating-point formats rather than integers — eight and four bits, but with a tiny exponent so they keep some of floating point's range. They matter because of hardware: NVIDIA's 2025 Blackwell generation added native four-bit tensor cores, and its NVFP4 format pairs four-bit values with a two-level scaling scheme so that a model quantized from eight-bit down to four-bit loses, by NVIDIA's published measurement, one percent or less of accuracy on language tasks while running with up to four times the throughput of eight-bit on the same chip. FP4 is a datacenter story for now — it needs Blackwell-class hardware — but it is where the precision frontier is heading, and it tells you that "four-bit" is becoming a first-class citizen rather than an act of desperation.
Figure 2. The quantization zoo. Eight-bit INT8 is the safe edge default; four-bit AWQ, GPTQ, and GGUF run large models on small devices; FP8 and FP4 are the hardware-driven frontier.
Distillation: A Small Student Learns From A Big Teacher
Quantization shrinks the numbers but keeps the model's structure. Distillation does the opposite: it changes which model you have, training a smaller "student" network to reproduce the behaviour of a larger "teacher." The idea comes from a 2015 paper by Hinton, Vinyals, and Dean, and its insight is subtle and worth understanding because it explains why distillation works at all.
When a model classifies an input, it does not just output the single right answer — it outputs a full set of probabilities across every possible answer. Shown a photo of a husky, a good model says "ninety percent dog, eight percent wolf, one percent cat" — and that spread is information. It tells you the model thinks a husky is dog-like but wolf-adjacent, which is real knowledge about the world that the bare label "dog" throws away. Distillation trains the student not on the hard labels ("dog") but on the teacher's full soft probabilities ("ninety/eight/one"), and a temperature control softens those probabilities so the small near-zero values become visible and teachable. The student is learning the teacher's judgment, not just its final answer — which is why a small student trained this way beats a small model of the same size trained from scratch on labels alone.
The concrete results are the reason distillation earns its place. The 2019 DistilBERT result took a standard language model and produced a student that was 40% smaller and 60% faster while keeping 97% of the original's language understanding. The result that matters most for video is Distil-Whisper, the distilled version of OpenAI's Whisper speech-recognition model: the distilled large-v3 student has 756 million parameters against the teacher's 1.55 billion — roughly half the size — runs about 6× faster, and stays within one percent word-error-rate of the teacher on long-form audio. For a product that transcribes archives of video, that is the difference between a model that needs a datacenter and one that runs on a workstation, covered in the streaming-ASR lesson.
Distillation's cost is the mirror image of quantization's. Quantization is cheap to apply (PTQ takes minutes) but limited in how far it can shrink a model before quality falls off. Distillation can produce a dramatically smaller model that is genuinely fast, but it is expensive to create: you need the teacher, a large amount of data to train the student on, and real training compute — the student is a new model, trained almost from scratch. That is why you do not distil your own model lightly; far more often you download a student someone else distilled (Distil-Whisper, DistilBERT, the small versions of open vision-language models) and treat distillation as a procurement choice rather than an in-house project.
Distillation vs Quantization — And Why You Use Both
Put the two side by side, because the most common confusion is treating them as competing options when they are complementary stages.
| Quantization | Distillation | |
|---|---|---|
| What it changes | How weights are stored (fewer bits) | Which model you have (a new, smaller one) |
| Same model? | Yes — same architecture, cheaper numbers | No — a different, smaller architecture |
| Typical size win | 2× (FP16→INT8) to 4× (FP16→INT4) | 2× and up (e.g. 1.55B→756M params) |
| Speed win | On integer/low-bit hardware; mostly memory | Yes — fewer layers, fewer operations |
| Accuracy cost | <1% at INT8; small at 4-bit with AWQ/GGUF | ~1–3% if the distillation was done well |
| Cost to apply | Low (PTQ: minutes) to medium (QAT) | High — needs teacher, data, training compute |
| You usually | Apply it yourself, per deployment | Download a student someone already trained |
The crucial line is the last conceptual one: the two compound. You distil first to get a smaller model, then quantize that smaller model to make its numbers cheaper, and the savings multiply rather than add. Distil-Whisper's large-v3 is roughly half the teacher's size; quantize that student to eight-bit integers and it is half again — a model a quarter the footprint of the original, running several times faster, still within striking distance of the teacher's accuracy. The same stacking applies to vision-language models: take an already-small open VLM, apply AWQ four-bit weights, and a model that needed a datacenter card runs on a single consumer GPU or a high-end edge module. Compression is not a single decision; it is a sequence, and the sequence is distil-then-quantize.
On The Hardware: TOPS, Tensor Cores, And What "Edge" Can Do
Compression only pays off if the device can actually run the low-precision math fast, so a word on what edge hardware looks like in 2026. The headline number on an edge AI chip is TOPS — trillions of operations per second — and it is almost always quoted at INT8, because eight-bit integer math is the precision real edge models run at. That alone tells you how central quantization is: the hardware vendors quote their performance in the quantized format, not the training format.
NVIDIA's Jetson line is the reference point for serious edge video work. The entry Jetson Orin Nano, after a 2024–2025 software boost, delivers about 67 TOPS of INT8 performance in a module that draws between seven and twenty-five watts and carries eight gigabytes of memory; the larger Jetson AGX Orin reaches around 275 TOPS. Those chips run NVIDIA tensor cores that accelerate INT8 and FP16, and the standard deployment path is to convert a model to a TensorRT engine at INT8, exactly as a detector would be served in the Jetson capstone lesson. On phones, the equivalent runtimes are Apple's Core ML and Android's neural-network APIs driving the phone's dedicated AI chip (its NPU), and the equivalent compression target is INT8 or smaller, the path behind on-device features like the small-model work in the Depth Anything and SmolVLM lesson.
The practical lesson is that the eight-gigabyte memory ceiling and the INT8 TOPS rating are the two numbers that decide what fits. Our seven-billion-parameter example needed 14 GB at FP16 and would not load on an eight-gigabyte Jetson; at INT4 it needs 3.5 GB and fits with room for the model's working memory. The hardware sets the budget; compression is how you spend it.
Per-Stage: What To Compress In A Video Pipeline
Because a video feature is a pipeline of several model types — the chain from the serving lesson — the compression decision is made once per stage, and the right answer differs by stage. Here is the shape most teams converge on.
For the detection and segmentation stage — YOLO, SAM, pose models, the kind from the YOLO lesson — the answer is INT8 via TensorRT, applied with PTQ first and QAT only if the accuracy on the rare-but-important classes slips. These models are small to begin with and run many times per second, so eight-bit is the sweet spot: real-time speed, negligible accuracy loss.
For the speech-recognition stage, the answer is a distilled model, then quantized — Distil-Whisper rather than full Whisper, served on an efficient engine and often at eight-bit. Here distillation does the heavy lifting because the architecture itself shrinks; quantization adds a second multiplier on top.
For the vision-language and language stages — the describer and the summarizer — the answer is four-bit weights via AWQ or GGUF, because these models are the largest in the pipeline and the four-bit methods are mature enough to hold quality. This is where the memory math bites hardest, and where four-bit is the difference between fitting on the device and not.
Figure 3. Compression, stage by stage. Detection wants INT8; speech wants a distilled-then-quantized model; the big language and vision-language stages want four-bit. Every stage re-measures accuracy.
A Worked Example — Stacking The Two Levers
Make the savings concrete by stacking distillation and quantization on the speech stage, since that is where both apply cleanly. Start with the full Whisper large-v3 teacher: 1.55 billion parameters, stored at sixteen-bit, which is about 3.1 GB of weights and the throughput of the full model.
Step one, distil. Swap in Distil-Whisper large-v3, the student: 756 million parameters, roughly half the size, about 1.5 GB at sixteen-bit, and around 6× the speed, all within one percent of the teacher's word-error-rate on long-form audio. You have halved the memory and multiplied the speed before touching the numbers.
Step two, quantize. Take that 756-million-parameter student down to eight-bit integers. Memory roughly halves again:
distilled student at FP16: 756M × 2 bytes = ~1.5 GB
distilled student at INT8: 756M × 1 byte = ~0.76 GB
The combined result, relative to the original teacher, is a model about a quarter of the memory footprint (3.1 GB down to under 0.8 GB), running several times faster, while accuracy stays within roughly one percent of where it started. That model fits on the edge with ease, and on a Jetson-class module it transcribes faster than real time. Neither lever alone gets you there: quantization alone leaves the model twice as large as it needs to be; distillation alone leaves the numbers more expensive than they need to be. Used together, in order, they turn a datacenter model into a device model. This is the same compounding the cost-optimization lesson treats as a core lever, and why a compressed model and a good inference server multiply rather than substitute.
Where Fora Soft Fits In
We build video products across conferencing, streaming and OTT, e-learning, telemedicine, and surveillance, and many of their AI features have to run where the video is — on a camera in a store, a phone in a clinic, an on-premises box where footage cannot leave the building. Our compression practice follows the pipeline: detection and segmentation models exported to INT8 TensorRT engines for real-time work on Jetson-class hardware; speech handled with distilled-then-quantized models sized to the device; and the larger vision-language and language stages taken to four-bit weights with AWQ or GGUF so they fit on edge memory budgets. We treat distillation as a procurement choice — usually adopting a student someone has already trained and validated — and quantization as a per-deployment step that always ends with re-measuring accuracy on the client's own footage, because a number that holds on a public benchmark can move on a specific camera angle or a specific rare event. The verticals change the models and the accuracy targets; the method — distil to shrink the architecture, quantize to cheapen the numbers, measure after every step — does not.
What To Read Next
- vLLM + Triton + TensorRT — inference serving for video AI
- Model artifact formats and the open-vs-closed procurement decision
- Latency, deployment, and real-time vs batch — edge vs cloud
Talk To Us · See Our Work · Download
- Talk to a video engineer — plan which AI features in your video product can run on the edge, and how to distil and quantize the models to fit the device: /services/ai-software-development
- See our case studies — conferencing, streaming, surveillance, telemedicine, and AI work: /portfolio
- Download the edge-compression decision checklist — distillation vs quantization, the bit-width memory math, the six quantization methods, the per-stage decision for a video pipeline, and the questions to ask before you compress, on one page: Download the checklist
References
- Hinton, Vinyals, Dean — "Distilling the Knowledge in a Neural Network" (arXiv:1503.02531, 2015, accessed June 2026) — https://arxiv.org/abs/1503.02531 — tier 3 (foundational paper by the technique's authors). Source for knowledge distillation, the teacher-student framing, training the student on the teacher's soft probability distributions rather than hard labels, and the softmax-temperature control (values of roughly 1–10 work well). The origin of the method the article describes.
- Sanh, Debut, Chaumond, Wolf (Hugging Face) — "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (arXiv:1910.01108, 2019, accessed June 2026) — https://arxiv.org/abs/1910.01108 — tier 3 (peer-reviewed paper). Source for the DistilBERT result: a student 40% smaller and 60% faster that retains 97% of BERT's language understanding — the canonical concrete distillation number.
- Gandhi, von Platen, Rush (Hugging Face) — "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling" (arXiv:2311.00430, 2023;
distil-whisper/distil-large-v3model card, accessed June 2026) — https://arxiv.org/abs/2311.00430 — tier 3 (paper) + tier 4 (model card). Source for the Distil-Whisper figures: distilledlarge-v3at 756M parameters vs the 1.55B teacher (~49–51% smaller), ~6× faster, within 1% word-error-rate of the teacher on long-form audio. - Jacob, Kligys, Chen, Zhu, Tang, Howard, Adam, Kalenichenko (Google) — "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (arXiv:1712.05877; CVPR 2018, accessed June 2026) — https://arxiv.org/abs/1712.05877 — tier 3 (foundational paper). Source for integer-only inference, the scale-and-zero-point quantization scheme, quantization-aware training, and the close-to-4× memory-footprint reduction of INT8 weights and activations versus FP32.
- Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han (MIT) — "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" (arXiv:2306.00978; MLSys 2024 Best Paper, accessed June 2026) — https://arxiv.org/abs/2306.00978 — tier 3 (award-winning peer-reviewed paper). Source for AWQ: protecting the ~1% salient weight channels via activation-statistic-driven scaling, no back-propagation/reconstruction, near-FP16 quality at 4-bit, and generalization to instruction-tuned and multimodal LLMs.
- Frantar, Ashkboos, Hoefler, Alistarh — "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (arXiv:2210.17323, 2022/2023, accessed June 2026) — https://arxiv.org/abs/2210.17323 — tier 3 (peer-reviewed paper). Source for GPTQ as a one-shot post-training method that quantizes large language models to 3–4 bits with layer-by-layer error correction, the alternative four-bit route to AWQ cited in the article.
- NVIDIA — "Introducing NVFP4 for Efficient and Accurate Low-Precision Inference" (NVIDIA Technical Blog, 2025, accessed June 2026) — https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ — tier 3 (vendor primary, hardware owner). Source for the NVFP4 four-bit format (E2M1 with two-level scaling — blocks of 16 sharing an FP8 E4M3 scale plus a per-tensor FP32 scale), the up-to-4× throughput over FP8 on Blackwell, the ≤1% accuracy degradation FP8→NVFP4 on DeepSeek-R1, and Blackwell-class hardware requirement.
- NVIDIA — "Jetson Orin Nano Super Developer Kit" product page and "Jetson Orin Nano Developer Kit Gets a Super Boost" (NVIDIA, 2024–2025, accessed June 2026) — https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/ — tier 4 (vendor product documentation). Source for the Jetson Orin Nano's ~67 INT8 TOPS (40 TOPS base, raised by the "Super" software), 8 GB LPDDR5, 7–25 W envelope, the AGX Orin ~275 TOPS figure, and INT8/FP16 tensor-core acceleration.
- ggml-org / llama.cpp — "quantize" documentation and k-quant definitions; community evaluations of GGUF quantization (2024–2026, accessed June 2026) — https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md — tier 4 (reference implementation) + tier 6 (community benchmarks). Source for GGUF k-quants, Q4_K_M as 4-bit mixed precision (~4.5 bits/weight) keeping critical layers higher-precision, and the measured ~2% quality cost for ~40% memory saving relative to 8-bit. Benchmark figures are version-sensitive.
- Ultralytics / TensorFlow Model Optimization — "Quantization-Aware Training (QAT)" glossary and "Quantization aware training" guide (2025–2026, accessed June 2026) — https://www.ultralytics.com/glossary/quantization-aware-training-qat — tier 4 (framework documentation). Source for the PTQ-vs-QAT distinction: PTQ is faster and needs no retraining; QAT simulates quantization during training so the model adapts and recovers accuracy, at the cost of a training pipeline and compute — and the "PTQ first, QAT if needed" practice.


