Machine learning object recognition in camera systems for security and autonomous vehicles

Key takeaways

Integration is the product. A YOLO weights file has no commercial value until it ingests RTSP, emits ONVIF metadata, and lands a detection event in Milestone, Genetec, or your own VMS within 300 ms.

Inference location is a cost decision, not a technical one. On-camera inference at 2–5 W beats cloud at ~$0.10 per stream-minute above roughly 200 cameras; below that, Rekognition is cheaper than hardware.

The three-tier topology wins. Camera-side detection, gateway aggregation on Jetson or Hailo, cloud-side re-identification and search. Nothing in the real world runs purely in any one tier.

Privacy is a schema, not a policy. GDPR and CCPA compliance lives in how detection metadata is stored, blurred, and retained — not in an annual policy review.

Fora Soft delivers a custom integration for roughly $140K–$280K. Twelve weeks, tuned YOLOv9 or DETR models, ONVIF metadata emission, and a VMS-ready event pipeline. Agent Engineering trims roughly a third off legacy timelines.

Object recognition is the cheapest part of a camera analytics platform. Choosing where it runs, how it emits metadata, and which VMS receives the event is the expensive part — and the part vendors do not help with. This playbook is the architecture, hardware, and integration pattern we use to ship custom object-recognition camera solutions at Fora Soft in 2026.

Planning a camera analytics build that has to land in your existing VMS?

Bring the camera brand, VMS, and concurrent-stream count — we will come back with an inference topology, hardware shortlist, and a twelve-week integration plan.

Book a 30-min call →

Why Fora Soft wrote this playbook

Fora Soft has shipped video analytics products continuously since 2005, and object detection on IP camera feeds has been in our stack since MobileNet-SSD on NVIDIA Jetson TX1. What follows is the architecture we actually use: what we put on the camera, what we run on the gateway, what we push to the cloud, and how we make the result behave like a first-class citizen inside Milestone XProtect, Genetec Security Center, or a custom VMS.

We focus on the 2026 integration reality: YOLOv9, DETR, and a handful of purpose-built models running under TensorRT, Hailo HEF, Axis ACAP, or CoreML; events flowing through MQTT or Kafka; metadata emitted as ONVIF Profile T XML; embeddings stored in Milvus for re-identification. If your use case already has a packaged product (vehicle counting, license plate recognition), buy it. If it has a twist, build it — with the patterns below.

What changed in 2024–2026

Three shifts made custom object recognition both easier and harder between 2024 and 2026.

Easier: edge silicon exploded. Hailo-8 on a USB stick delivers 26 TOPS at 2.5 W. Ambarella CV7 lands 15 TOPS inside a camera SoC. Sony IMX500 puts a tiny classifier on the sensor itself. Nvidia Jetson Orin Nano offers 40 TFLOPS FP32 for under $500. Inference that needed a GPU server in 2022 now sits behind a PoE port.

Easier: YOLOv9 and YOLO-NAS narrowed the accuracy gap. YOLOv9-E hits 56% mAP on COCO, a figure that would have required a two-stage detector four years ago. YOLOv8n inferences in 1.47 ms on a T4 under TensorRT. The open-source model is almost always good enough; the weights file is not the moat.

Harder: GDPR and CCPA got teeth. CNIL issued over 100 surveillance-related decisions in 2025–2026 with sanctions above EUR 200,000. California’s CCPA amendments expanded biometric retention rules. Custom privacy engineering — face blurring, license plate masking, role-based access to raw footage — is now a hard requirement, not a nice-to-have.

Three-tier architecture: camera, gateway, cloud

Every deployment larger than a handful of cameras ends up with the same three-tier pattern. Read it top-down.

Tier 1 — On-camera inference. Lightweight detectors (YOLOv8n, MobileNet-SSD, purpose-built occupancy classifiers) run inside the camera firmware via Axis ACAP, Ambarella CV7 SDK, or an on-sensor runtime like Sony IMX500. Output: bounding boxes and class labels emitted as ONVIF Profile T metadata XML alongside the RTSP stream. Latency: 30–80 ms. Power: 0.5–3 W above the baseline camera draw.

Tier 2 — Gateway aggregation. NVIDIA Jetson Orin, Hailo-8 M.2 accelerators, or an Intel OpenVINO gateway ingest multiple RTSP streams, run heavier models (YOLOv9-C, DETR, action recognition), and perform cross-camera reasoning: tracking a person across the factory floor, counting dwell time, correlating ANPR with access control. Latency: 80–200 ms end to end. Target density: 8–32 streams per Jetson Orin NX.

Tier 3 — Cloud analytics and search. The cloud is the system of record: detection metadata, embeddings, audit logs. It runs the expensive jobs (re-identification across days, forensic search by appearance, analytics dashboards) and nothing real-time. This is also where AWS Rekognition Video, Azure Video Indexer, or Google Cloud Vision slot in when you need a managed service for a specific capability.

The split forces a decision in week one: what stays local, what gets a round trip, and what never leaves the network. Get it wrong and you are either saturating WAN links with raw HD video (cloud-only mistake) or asking a camera to run a transformer it cannot fit in memory (edge-only mistake).

Reach for on-camera inference when: bandwidth is constrained, privacy demands the pixel never leaves the site, or the workload is a single-class classifier (motion, occupancy, forklift detection) that fits under 50 MB.

Edge silicon in 2026: Axis, Hailo, Ambarella, Jetson, Sony

Hardware choice follows three dimensions: where the chip sits, how many TOPS it delivers, and which runtime the team is willing to target.

Target TOPS Runtime Typical price Best for
Axis ARTPEC-8 + ACAP ~6 TOPS ACAP native Camera MSRP Axis-standardized fleets
Ambarella CV7 ~15 TOPS CVflow SDK OEM camera 4K + analytics in-camera
Sony IMX500 ~1–4 TOPS Sony AITRIOS Sensor MSRP On-sensor classifiers
Hailo-8 / Hailo-8L 26 / 13 TOPS HailoRT, HEF $130–$250 per unit M.2 retrofit on NVR / gateway
NVIDIA Jetson Orin Nano / NX 40 / 100 TOPS TensorRT + DeepStream $249–$999 Multi-stream gateway
Intel OpenVINO CPU/iGPU ~4–8 TOPS effective OpenVINO IR Existing hardware Low-density retrofit

For greenfield deployments we default to Hailo-8 M.2 cards plugged into an off-the-shelf mini PC or a used Dell OptiPlex; they deliver 29.5 fps on YOLOv8n at 640×640 and pull 2.5 W under load. For existing Axis estates, ACAP on the camera is the lowest-friction path and keeps the gateway simple. Jetson Orin wins when one box must run heterogeneous models (detection + pose + ANPR) on 16–32 streams at once.

Reach for Jetson Orin when: your gateway must run four or more different model families simultaneously — most Hailo-8 deployments hit a DSP scheduling wall past two concurrent model graphs.

Model selection: YOLOv9, DETR, or something smaller

Three model families cover 90% of object-recognition workloads in 2026. Pick by latency budget and whether the box has a GPU.

YOLO family (YOLOv8, YOLOv9, YOLO-NAS). One-stage detectors, anchor-free. YOLOv8n for edge cameras (fits in 6 MB, 1.47 ms on T4), YOLOv8m for Jetson (30 fps at 1080p), YOLOv9-E for cloud or server-class gateways (56% mAP COCO). Ultralytics toolchain exports cleanly to TensorRT, ONNX, HailoRT, OpenVINO, and CoreML.

DETR and variants (DETA, RT-DETR, Deformable DETR). Transformer detectors. Cleaner behavior on crowded scenes because anchor matching is replaced by set prediction. RT-DETR hits YOLOv9-speed with transformer semantics. Use when the scene has 30+ overlapping objects or when the downstream system wants global attention maps for explainability.

Purpose-built smaller models. MobileNet-SSD, EfficientDet-Lite, or a tiny custom classifier when the task is single-class (forklift, hard-hat, fire). A fire-detection classifier under 2 MB beats YOLO for false-positive rate because the training set is narrower and better curated. Never use a general-purpose model when a specific one exists.

Model COCO mAP Latency (T4 FP16) Where it runs
YOLOv8n 37.3% 1.5 ms On-camera, Hailo, Jetson
YOLOv8m 50.2% 4.2 ms Jetson Orin NX, Hailo-8
YOLOv9-E 56.0% 12 ms Server GPU, cloud
RT-DETR-L 53.0% 9 ms Server GPU
MobileNet-SSD (300) 24.0% Sub-ms on camera IMX500, low-end ACAP

TensorRT INT8 quantization cuts latency by a further 3–5× on NVIDIA hardware; Hailo’s HEF compiler and Intel’s OpenVINO Post-Training Quantization produce comparable results on their targets. The quantization gap between FP32 and INT8 is typically 1–2% mAP on YOLO family models; rarely worth fighting.

Reach for RT-DETR when: the scene holds 30+ overlapping objects, or the downstream system needs attention maps for explainability — YOLO’s anchor matching starts to mis-associate boxes under dense crowd or warehouse layouts.

ONVIF Profile T: the metadata contract

A detection event is worthless if it does not land inside the VMS the security team already uses. ONVIF Profile T, the analytics profile, defines the XML schema for detection metadata that Milestone, Genetec, Avigilon, Axis Camera Station, and most open-source VMS platforms consume out of the box.

The contract is simple: the camera or gateway emits an RTSP stream with a parallel metadata track. Each frame carries a MetadataStream element containing one or more Object entries — each with a bounding box (normalized 0.0–1.0), a class label, a confidence, and a stable tracker ID. Timestamps must be frame-synchronized with the video track; drift above 40 ms confuses the VMS’s event correlation engine.

Axis ACAP apps emit ONVIF metadata natively. NVIDIA DeepStream has an onvif-metadata-broker plugin as of DeepStream 6.4. For bespoke pipelines, we usually ship a small Go or Rust service that wraps the inference output and speaks ONVIF to the VMS — it is roughly 600 lines of code and lives in the gateway.

For VMS-specific integration (Milestone MIP SDK, Genetec SDK, Avigilon Control Center SDK), we also emit a parallel webhook or SDK call because the VMS’s rule engine binds to proprietary events more predictably than to generic ONVIF. Two tracks, one source of truth.

The event pipeline: MQTT, Kafka, or webhooks

Detection events need a bus. Three options cover every realistic deployment.

MQTT. Default for low-count, edge-heavy deployments (under 200 cameras, under 500 events per second). Mosquitto or HiveMQ as the broker. QoS 1 for reliability. Fits cleanly on the same gateway that runs inference. NVIDIA DeepStream publishes to MQTT natively.

Kafka. Default above 500 events per second or when multiple independent consumers need the stream (VMS, analytics warehouse, SIEM, alerting). Confluent Cloud, MSK, or self-hosted Strimzi. Topics per camera-group let consumers subscribe without seeing every event. Retention at seven days is typical for replay and debugging.

Webhooks. Use when the consumer is a single SaaS (Splunk, PagerDuty, a ticketing system) and you do not want another broker in the stack. Sign every webhook with HMAC-SHA256; do not trust the source IP.

We almost always combine two: MQTT from camera/gateway up to a local aggregator, Kafka from aggregator out to downstream consumers. That split survives a WAN outage (MQTT keeps buffering locally) and scales horizontally (Kafka consumer groups take the load).

Detection is only half the problem. Once a box is drawn, the commercially interesting question is: “is this the same person we saw yesterday at camera 12?” That is a re-identification problem, and it runs on embeddings, not detections.

The pattern we ship most often: a 256- or 512-dimensional embedding per detected object, computed on the gateway by a lightweight embedding model (OSNet-x0_25 for persons, a color + texture model for vehicles) and written to Milvus, Weaviate, or Qdrant. Query latency at 100 million embeddings, well-indexed with IVF_PQ, is under 50 ms on a single modest VM.

Forensic search (“find everyone wearing a red jacket who entered Zone 3 yesterday”) becomes a vector query with metadata filters. For a client who needs it, we typically bolt in ElasticSearch for the metadata facet (timestamp, zone, camera) and Milvus for the vector nearest-neighbor — roughly the same split as modern retail search.

Embeddings raise privacy stakes. An embedding is a biometric identifier under most regulations; treat it as such. Expire embeddings on the same schedule as raw footage, not longer.

Privacy and compliance baked in

GDPR Article 6 lawful-basis analysis, CCPA biometric provisions, and the 2024 EU AI Act classify most public-space object recognition as high-risk processing. The engineering implications are concrete.

Pixel-level anonymization at the source. Face blurring and license plate masking run on the same gateway as detection. The un-anonymized frame never leaves the gateway unless an authorized investigator triggers an escrow unlock with a signed warrant-equivalent audit record. Libraries: OpenCV GaussianBlur for faces when throughput matters, or a dedicated face-parser for segmentation-quality masks.

Role-based access to raw vs. anonymized feeds. The VMS integration should surface two parallel streams. Default UI shows the anonymized stream. Raw access requires an elevated role and writes an audit event to the SIEM. This is the single most-cited CNIL violation we see.

Retention windows per data class. Raw video: 7–30 days typical, max 90 days without a specific legal hold. Anonymized video: same or shorter. Detection metadata: 1–3 years for analytics use cases, expirable per subject request. Embeddings: same as raw video.

Subject access and erasure. A GDPR data-subject-access-request workflow needs to find every frame containing a given face or plate — the same embedding index used for forensic search. Budget this capability; retrofitting it once the DPO asks is a multi-sprint surprise.

Reach for on-site anonymization when: your cameras record public space or shared workplace areas in the EU, UK, or California — cloud-side blurring creates a custody gap regulators do not accept.

DPO asking pointed questions about your camera analytics pipeline?

We run a two-week privacy-by-design review: anonymization topology, retention schema, subject-access workflow, and GDPR Article 35 DPIA evidence. Leave with an audit pack the DPO can sign.

Book a 30-min privacy review →

Edge, gateway, or cloud: the cost tradeoff

Where inference runs is a unit-economics decision. Below is the rough breakeven math we use on every project kickoff.

Cloud-only (AWS Rekognition Video or Azure Video Indexer). Roughly $0.10 per stream-minute, or about $4,380 per camera per year at 24/7. Fine for 10–50 cameras; catastrophic above 200.

Gateway-based (Jetson or Hailo on-prem). One $1,500 Jetson Orin NX handles 16–32 streams, amortized over five years that is under $19 per camera per year in hardware, plus maybe 30 W of draw per box. Software licensing (Milestone XProtect, etc.) is a separate conversation.

Camera-native (Axis ACAP, Ambarella). Zero additional hardware. Model updates ship as signed ACAP packages or camera firmware. Works cleanly only when the camera already has inference silicon; retrofitting older cameras is not possible.

The crossover math: below 40–50 cameras cloud-only is often cheapest because you avoid capex. Between 50 and 200, a Jetson gateway usually wins. Above 200, camera-native becomes attractive because you are already specifying new hardware on a refresh cycle.

Build vs. buy: where custom earns its keep

Off-the-shelf analytics products (Briefcam, iOmniscient, Viakoo, Avigilon Unusual Activity) cover standard workloads — people counting, perimeter breach, abandoned object, loitering, license plate. Custom engineering earns its keep in three cases.

Industry-specific class taxonomies. A packaged model knows “person, car, truck.” Your operations need “forklift vs. pallet jack,” “hard-hat vs. helmet vs. bump-cap,” or “surgical mask vs. N95.” A custom YOLOv9 fine-tune trained on 3,000–10,000 of your own labeled frames will beat a generic model by 5–15 percentage points mAP on the classes that matter.

Cross-system workflows. A detection by itself does nothing. “Object detected AND door unlocked AND access badge not scanned” is a compound event that packaged products do not express. Custom rule engines (Drools, Go-based CEP, or a hand-rolled state machine) close the gap.

Data sovereignty constraints. Packaged SaaS sends frames to the vendor cloud. That is a non-starter for most healthcare, defense, finance, and critical infrastructure customers. A custom stack keeps frames on-site or in a specific region.

Cost model: a 12-week custom integration

Numbers below are Fora Soft 2026 estimates with Agent Engineering, for a custom object-recognition layer built on top of an existing VMS and camera fleet. They are conservative.

Phase 1 — Model and pipeline MVP (3–4 weeks). Label collection (outsourced or existing), YOLOv9 or DETR fine-tune, TensorRT or HailoRT export, basic MQTT event emission, docker-compose gateway. Budget: ~$25K–$45K.

Phase 2 — VMS and metadata integration (3–4 weeks). ONVIF Profile T metadata emission, Milestone MIP or Genetec SDK wiring, VMS event rules, operator UI overlay. Budget: ~$35K–$65K.

Phase 3 — Re-id, search, and privacy (4–6 weeks). Embedding pipeline, Milvus setup, face/plate anonymization, role-based access, audit log, retention jobs, DPIA artifacts. Budget: ~$55K–$95K.

Phase 4 — Hardening (2 weeks). Load testing at target camera count, failover drills, documentation, operator training. Budget: ~$25K–$45K.

Total. ~$140K–$250K for a 50–150 camera deployment; ~$200K–$350K for 150–500 cameras with multi-site. Running costs (Milvus, Kafka, egress, watermarking if any) typically $3K–$15K per month depending on volume.

Mini-case: 220 cameras, three countries, one forklift problem

A logistics operator with 220 cameras across warehouses in Germany, Poland, and the UK asked us to build a “forklift-crossing-pedestrian” alert that their existing Milestone XProtect installation could surface as a first-class event. Packaged analytics priced the work at roughly €18 per camera per month on top of their existing licensing; worse, the packaged model emitted a 30% false-positive rate on their mezzanine layout.

The fix was a 14-week custom build. We labeled 7,400 frames from their own footage, fine-tuned YOLOv9-C on “forklift / pedestrian / pallet-jack”, deployed on a Hailo-8 M.2 card plugged into one small-form-factor PC per warehouse, emitted ONVIF Profile T metadata plus a Milestone MIP SDK event for compound-rule evaluation, and shipped an operator overlay that surfaced the compound event in the existing XProtect UI. False-positive rate dropped to 4.1% on the validation set; detection latency end-to-end settled at 180 ms.

Running cost after go-live came in at roughly €3.40 per camera per month (Hailo hardware amortization plus a tiny Kafka cluster shared across sites). Want a similar assessment for your fleet? Book a 30-min camera analytics review — bring your VMS, camera brands, and a handful of false-positive videos.

Packaged analytics drowning your operators in false positives?

We benchmark your current false-positive rate, fine-tune a custom detector on your own footage, and ship it into your VMS. Four to six weeks to a measurable drop.

Book a 30-min analytics audit →

A decision framework in five questions

1. How many cameras, and where are they concentrated? Under 40, cloud-only is probably cheapest. Over 200, camera-native or gateway. 40–200 usually settles on Jetson or Hailo gateways.

2. What VMS is already in place? Milestone XProtect, Genetec Security Center, Avigilon, or custom. The VMS sets the metadata contract (ONVIF Profile T plus vendor SDK) and the operator UI path.

3. Do your classes exist in a stock model? If “person, vehicle, animal” covers it, start with a stock YOLO fine-tune or a packaged analytics product. If you need “forklift vs. pallet jack,” budget a labeling round.

4. What regulation applies? EU, UK, or California deployments need privacy-by-design from sprint one. APAC and LATAM vary by jurisdiction; legal should weigh in before any embedding leaves the device.

5. Does the problem require compound events? If yes, plan for a CEP engine or rule-based state machine from the start. Retrofitting compound logic into a single-detection pipeline always doubles the timeline.

Five pitfalls that derail camera-analytics projects

1. Training on stock COCO and calling it done. COCO labels are biased toward consumer imagery. Warehouse lighting, indoor industrial scenes, and low-light cameras sit outside the distribution. Budget a 3,000–10,000 frame custom label round; it determines the entire project’s mAP.

2. Picking Jetson because it is familiar, then needing Hailo anyway. Jetson is flexible but power-hungry; Hailo is efficient but limited to two or three model graphs. A one-hour hardware-choice workshop before the first PO saves weeks of rework.

3. Forgetting the ONVIF metadata timestamp. A 40 ms drift between video frame and metadata frame confuses the VMS tracker. We have seen deployments shipped with a 300 ms drift because nobody tested the metadata pipe end-to-end in VMS.

4. Embedding the wrong vector dimension. 128-d is too small for person re-id across months, 1024-d is wasteful and slow. 512-d OSNet or MobileFaceNet embeddings are the sweet spot for enterprise-scale Milvus indexes.

5. Treating privacy as a final-week checkbox. Anonymization, RBAC, and retention need to be designed into the data flow, not bolted on. A single CNIL fine above €100,000 has killed more custom analytics projects than any technical failure.

KPIs worth putting on the dashboard

Model KPIs. mAP on the customer’s own validation set (not COCO), false-positive rate per 24 h per camera, false-negative rate on the top three operationally critical classes. Re-evaluate after every deployment round.

Pipeline KPIs. End-to-end detection latency p95 (camera to VMS event), metadata timestamp drift, stream loss rate, inference queue depth per gateway. Page the on-call if p95 drifts past 400 ms.

Business KPIs. Alarms per operator shift, operator acknowledgment latency, incidents prevented (ties to customer’s own incident log), cost per detected event. These are what keep the project funded for year two.

When a custom build is the wrong answer

Three situations call for a packaged product instead. If your deployment is under 40 cameras and single-site, a Rekognition Video + a basic VMS like Milestone Essential+ will cost less over three years than the custom integration. If your use case is a solved commodity (license plate recognition, perimeter breach, mask detection circa 2021), the packaged analytics from Axis, Briefcam, or Avigilon are already trained on tens of thousands of hours of footage you cannot match. And if your security team has no AI operational expertise, a SaaS vendor’s managed service beats a custom pipeline running on hardware nobody on staff understands.

Custom earns its keep when the class taxonomy is proprietary, when the workflow is compound, or when data sovereignty is non-negotiable. Otherwise, buy.

FAQ

How many labeled frames do we need to fine-tune a detector?

For a single new class on a YOLO-family fine-tune, 3,000–5,000 labeled frames covering varied lighting and angles is usually enough to exceed a packaged model on your own footage. For 5–10 new classes or rare events, budget 10,000–30,000. Active-learning loops — label the model’s low-confidence frames — are more efficient than random sampling past the first thousand.

Can we run YOLOv9 on an existing Axis camera?

Only on cameras with ARTPEC-7 or ARTPEC-8 chips that support ACAP-native ML inference. Full YOLOv9-E will not fit, but quantized YOLOv8n deploys cleanly. For older ARTPEC cameras without an ML accelerator, run inference on a Jetson or Hailo gateway adjacent to the camera and emit ONVIF metadata back to the VMS from there.

What is the realistic stream density on a Jetson Orin Nano?

Eight 1080p streams running YOLOv8m at 15 fps detection cadence, or sixteen streams at 10 fps. Orin NX doubles that. TensorRT INT8 plus DeepStream pipeline tuning is the difference between “works” and “falls over at 6 streams.”

Is AWS Rekognition Video cheaper than running our own Jetson?

Below roughly 40 24/7 cameras, yes. Rekognition Video at $0.10 per stream-minute amortizes to about $4,380 per camera per year; a $1,500 Jetson Orin NX running 16 streams is about $19 per camera per year in hardware plus power. The crossover depends heavily on how many hours per day the cameras are actually active.

Does ONVIF Profile T work with Milestone XProtect out of the box?

Partially. XProtect consumes Profile T metadata for on-screen overlays and basic event rules, but any compound rule (“object detected AND badge not scanned”) needs the Milestone MIP SDK. We routinely ship both: ONVIF for the overlay, MIP SDK events for the rule engine. Genetec Security Center has the same split with its own SDK.

How do we handle GDPR when the cameras record a public street?

Document a legitimate-interest basis under Article 6(1)(f), complete a DPIA under Article 35, apply pixel-level face blurring before any frame leaves the gateway, stage raw footage behind a role-based access control with audit logging, and set retention to the minimum that serves your documented purpose (typically 30 days). The DPO must sign off; do not skip this step.

What changes with the EU AI Act?

Real-time remote biometric identification in public spaces is largely prohibited with narrow exceptions. Emotion recognition in workplaces and schools is prohibited. High-risk systems — most enterprise video analytics with biometric features — require risk management, data governance, logging, human oversight, and a conformity assessment. Plan for a formal technical file and an EU representative if you are non-EU.

How long does a realistic 50-camera rollout take?

Twelve to sixteen weeks from kickoff to production with Agent Engineering, assuming an existing VMS and accessible camera footage for training. Week one to four is model and pipeline; five to eight is VMS integration; nine to twelve is privacy, search, and hardening. The labeling round usually runs in parallel with pipeline work.

Industry

Developing Object Recognition Camera Solutions for Specific Industries

Sector-specific patterns: manufacturing, retail, logistics, healthcare.

VMS

Custom VMS Development: Building Video Management Systems

How to build your own VMS when Milestone and Genetec don’t fit.

Edge

Edge Computing in Live Streaming

Why edge nodes matter for latency-sensitive inference pipelines.

Streaming

Custom Video Streaming Software Development

Architecture choices for the delivery layer under camera analytics.

Services

Video & Audio Streaming Software Development

What we do, how we engage, and what a typical sprint looks like.

Ready to ship object recognition your VMS actually understands?

Object recognition in 2026 is three decisions stacked: the detector (YOLOv9, RT-DETR, or a custom tiny classifier), where it runs (on-camera, Jetson or Hailo gateway, or a thin cloud layer), and how it speaks to the VMS (ONVIF Profile T plus the vendor SDK). Get those three right and the software half of the problem is mostly solved. Get any one wrong and the false-positive rate will grind your operators down until they stop acknowledging alerts.

Privacy and compliance are the invisible fourth decision. Anonymization, role-based access, retention, and audit trails are what separate a product that ships and a project that gets killed by legal. Build them in from sprint one.

Let’s size your object-recognition integration

Tell us your camera count, VMS, and the class taxonomy that matters — we will come back with a twelve-week plan, hardware shortlist, and a fixed-price estimate.

Book a 30-min call →

  • Technologies