Inference is the act of running a trained model to get an answer — feeding a video frame in and getting "person, 0.92 confidence, here" out. It is distinct from training, which is the expensive, one-time process of teaching the model from data. In a surveillance system, training happens in a lab; inference happens continuously, on every frame, wherever the analytics live — on the camera's NPU, an on-prem GPU server, or in the cloud.
Inference is the workload that actually sizes a deployed system. Every camera being analysed is a stream of inferences per second, and the compute, latency, and cost all flow from how many inferences must run, how big the model is, and where it runs. Models are often optimised for inference by quantising them (reducing numeric precision to 8-bit) so they run faster on edge silicon, which typically costs only a few percentage points of accuracy for a large speed-up.
The pitfall is forgetting that inference quality is a range, not a guarantee. The same model gives different accuracy depending on resolution, scene, lighting, and how aggressively it was compressed to fit the hardware — and it is never perfect. Plan around realistic precision/recall figures for the actual conditions, and treat the placement of inference (edge vs server vs cloud) as the core trade-off between speed, cost, and accuracy. The model internals belong to the AI for Video Engineering section.

