
Key takeaways
• VALT proves the model. The intelligent video surveillance system Fora Soft built generates $9.7M ARR, supports 2,500 IP cameras, and serves 25,000 daily users across 650 US organizations.
• The market backs it. Global VMS spend jumps from $54.4B (2024) to $88.7B (2030) at 8.5% CAGR; AI video analytics alone grows 22.7% CAGR through 2031.
• Intelligence beats raw recording. Motion-triggered capture, spoken-word search, PTZ presets, transcription PDFs, and mobile-camera ingest are the features that close enterprise deals in 2026.
• Architecture decides scale. ONVIF Profile T cameras, H.265 encoding, Wowza ingest, and a Vue 3 / Node.js / Symfony stack let one cluster handle 2,500 cameras without bandwidth blow-ups.
• Build vs. buy is a camera-count question. Under 100 cameras — buy Verkada or Eagle Eye. Over 500 cameras with custom workflows — the build math wins, and Agent Engineering at Fora Soft compresses the timeline below industry averages.
Why Fora Soft wrote this playbook
We did not write this article from a marketing brief. We wrote it because we shipped VALT — the intelligent video surveillance system featured below — and now operate it for 650+ US organizations including police departments, child advocacy centers, and medical schools. Every architecture call, every codec choice, every PTZ workflow on this page was tested in production against real interrogation rooms, real training simulations, and real recorded evidence chains.
VALT is not a one-off. Our portfolio includes drone-based aerial surveillance for land control, custom anomaly-detection ML pipelines, ONVIF Profile M analytics integrations, and mobile-camera ingest for clients who needed surveillance beyond a fixed camera grid. Roughly 40% of our active engineering capacity sits in video, streaming, and computer vision — the disciplines a modern video surveillance development partner needs to ship a credible product.
If you are scoping a VMS, an intelligent monitoring add-on to an existing camera fleet, or a vertical surveillance product (police, healthcare, education, retail, smart city), this article walks the same trade-offs we walked with our clients — from streaming protocol selection down to dollar costs — and shows how Fora Soft delivers the build using Agent Engineering to compress timelines and price.
Scoping your own intelligent video surveillance system?
Get a 30-minute architecture review with the team that shipped VALT — bandwidth math, codec choices, ONVIF compatibility, ML feature roadmap.
The VALT case in numbers
VALT is an intelligent video surveillance system — or, in industry terms, a Video Management System (VMS) with native AI analytics. Below is what the business looks like today.
| Metric | Today | What it proves |
|---|---|---|
| Annual revenue | $9.7M | A vertical, intelligent VMS scales financially without competing with Verkada head-on. |
| IP cameras supported | 2,500 | Architecture handles enterprise camera density, not just SMB demos. |
| Daily active users | 25,000 | Concurrent stream playback & RBAC scale beyond pilot deployments. |
| Customer organizations | 650 (US) | Tenant isolation, audit logging, billing all proven at scale. |
| Vertical mix | Police, medical schools, child advocacy | Three regulated environments running on one product — CJIS, HIPAA, FERPA touchpoints all covered. |
| Headline AI features | Motion-triggered recording, spoken-word video search, mobile-camera ingest, transcription-to-PDF | Differentiation against generic NVR / cloud VMS competitors. |
The product video below shows the workflow operators actually use. It is the real interface, not a marketing render.
Watch the 2-minute VALT product walkthrough on YouTube →
What intelligent video monitoring actually means in 2026
The phrase is used loosely. To buyers, “intelligent” can mean a marketing-grade motion sensor or a full ML pipeline that classifies behaviour. Below is the working definition we use when scoping projects — built from the features modern enterprise procurement actually checks.
Tier 1 — Smart capture (table-stakes in 2026)
Motion-triggered recording. The camera or the server detects movement and only captures relevant clips, slashing storage cost by 60–90%. VALT’s motion-detection feature is what lets a 30-day retention policy fit on commodity disk.
PTZ presets and scheduled recording. Operators define camera positions for shift changes — one click sweeps the whole interrogation room or the whole lab bench. Combine with a scheduler and the camera covers a six-hour simulation without an operator.
Push-to-talk audio. Two-way voice over the same camera is the line between “CCTV” and “monitoring system.” Required for medical instructors and for child-advocacy interviewers.
Tier 2 — Searchable evidence
Spoken-word search. Type a word, the system jumps to every moment it was said. VALT layers AWS’ Amazon Transcribe on the audio track and indexes the transcripts. For police interrogations, this turns an 8-hour video into a one-keystroke evidence finder. We covered the underlying integration in our custom speech-to-text development service.
Time-stamped notes. Reviewers tag specific frames; the tags become hyperlinks in an exported PDF. The PDF is the artefact that goes to court, into student-feedback files, or into a child-protection case.
Granular permissions. Per-camera, per-folder, per-user RBAC backed by an immutable audit log. Without it the product cannot pass CJIS or HIPAA review.
Tier 3 — ML-driven understanding
Object & person detection. YOLOv8/YOLOv11 pipelines tag people, vehicles, faces, packages. Our deep dive on detection models is in Top 7 Anomaly-Detection Models for Video Surveillance.
Anomaly detection. Unsupervised models flag unusual movement patterns — loitering, after-hours presence, abnormal crowd density. See real-time anomaly detection in video surveillance for the inference patterns we use.
Behaviour analytics. Counting, dwell time, queue length, PPE compliance, fall detection. The technical playbook is documented in our guide on integrating video analytics with surveillance.
Reach for full ML stack when: the buyer’s primary KPI is operator hours saved (false-alert reduction, automated incident review) rather than just compliance recording. Below ~50 cameras, table-stakes Tier 1+2 ships the same outcome at half the build cost.
VALT under the hood — the stack we shipped
Architecture is where most VMS builds break. We chose a stack that is open, fast to iterate, and cheap to scale.
| Layer | Tech | Why we picked it |
|---|---|---|
| Frontend | Vue 3 (Composition API) | Fast iteration on a dense operator UI; reactive refs map naturally to camera grid state. |
| API & auth | Symfony 5 (PHP) | Mature RBAC, easy audit-log middleware, strong test ecosystem — matters for CJIS/HIPAA scoping. |
| Realtime / signalling | Node.js + Socket.io | Browsers, mobile clients and the Symfony API speak the same event protocol with sub-100ms latency. |
| Streaming engine | Wowza Streaming Engine | RTSP ingest from cameras, transcode to HLS/WebRTC, scales horizontally on commodity Linux. Compared the alternatives in our P2P vs MCU vs SFU piece. |
| ASR / transcription | Amazon Transcribe | Pay-per-minute, custom vocabularies for legal/medical jargon, English & Spanish out of the box. |
| Storage | S3 + lifecycle tiering | Hot 7d → warm 30d → Glacier; cuts retention cost ~70% versus single-tier. |
| Camera ingest | ONVIF Profile T + RTSP, plus iOS/Android SDK for mobile-as-camera | Profile T covers H.265, HTTPS auth, modern PTZ. Mobile ingest unlocks pop-up sites without buying hardware. |
The deep dive on ONVIF Profile M and where it fits next to Profile T lives in our ONVIF Profile M explainer.
Reference architecture for an intelligent VMS
The diagram below is the canonical pipeline we deploy. Cameras push RTSP into a Wowza cluster, the cluster fans out HLS for many viewers and WebRTC for the live ops console, and parallel jobs feed the ML and ASR layers.
| Stage | Component | Output |
|---|---|---|
| 1. Capture | ONVIF Profile T cameras / mobile SDK | RTSP H.265 stream |
| 2. Ingest & transcode | Wowza cluster (Linux, autoscaling) | HLS (browsers), WebRTC (ops), MP4 (archive) |
| 3. Storage | S3-compatible object storage with lifecycle | Hot/warm/cold archive with TTL |
| 4. ASR | Amazon Transcribe / Whisper Large-v3 | JSON transcript with word-level timestamps |
| 5. ML inference | YOLOv8/v11 on Jetson Orin or T4 cloud GPU | Object/person/anomaly events |
| 6. Search index | OpenSearch / Elastic, time-aligned | Spoken-word + object search |
| 7. API | Symfony 5 + Node.js Socket.io | Auth, RBAC, audit log, realtime events |
| 8. Client | Vue 3 web + native iOS/Android | Operator console, mobile review, PDF export |
The same eight-stage layout works whether you have 50 cameras in one school or 2,500 across a national chain — you scale ingest and ML horizontally, keep the API and search single-tenant per customer.
Streaming protocol decision — RTSP, HLS, or WebRTC
A surveillance product almost always needs all three protocols. Buyers and engineers confuse them, then over-budget. The split below is what we run on VALT.
| Protocol | Latency | Concurrent viewers | Use it for |
|---|---|---|---|
| RTSP | 50–200 ms | 1 (server-only) | Camera-to-server ingest. Never expose to browsers. |
| WebRTC | ~300 ms | 100s per SFU | Live ops console, two-way audio, PTZ control. The latency floor for “real-time.” |
| LL-HLS | 2–5 s | Unlimited (CDN) | Mass dashboards, mobile review, low-end devices. |
| HLS (classic) | 8–12 s | Unlimited (CDN) | Recorded playback, evidence review, audit exports. |
Reach for WebRTC when: the operator must speak into the room (push-to-talk), control PTZ in real time, or react inside two seconds. Otherwise default to LL-HLS — cheaper at scale, fewer firewall headaches.
Picking between Wowza, Ant Media, and Janus?
We’ve shipped production VMS deployments on all three. We’ll match the engine to your camera count, latency target, and budget in one call.
Bandwidth and storage — the numbers that kill bad budgets
VMS budgets blow up when buyers underestimate camera bitrate. The math below is the rule-of-thumb we use during scoping.
| Resolution | Codec | Bitrate | GB/day per camera | 100 cameras / 30 days |
|---|---|---|---|---|
| 720p @ 15 fps | H.265 | 0.3 Mbps | 3.2 GB | ~10 TB |
| 1080p @ 30 fps | H.265 | 0.5 Mbps | 5.4 GB | ~16 TB |
| 1080p @ 30 fps | H.264 | 1.0 Mbps | 10.8 GB | ~32 TB |
| 4K @ 30 fps | H.265 | 1.5–3 Mbps | 16–32 GB | ~50–100 TB |
| 1080p @ 30 fps + motion-only | H.265 | 0.5 Mbps (peak) | ~1 GB (typical) | ~3 TB |
Three rules fall out of this table. One: H.265 is non-negotiable; it halves storage at the same visual quality. Two: motion-triggered recording cuts another 70–80%. Three: use tiered storage — hot for 7 days, warm for 30, cold for the legal-hold tail — or your S3 bill becomes the line-item the CFO blocks.
Build vs. buy — the camera-count rule
The honest answer is shaped by camera count, vertical, and how custom the workflows are. Below is the framework we walk through with prospects, including the cases where we explicitly recommend buying a vendor product instead of hiring us.
| Scenario | Recommendation | Why |
|---|---|---|
| < 100 cameras, generic use case | Buy Verkada / Eagle Eye / Rhombus | Hardware + cloud subscription cheaper than a custom build. |
| 100–500 cameras, custom analytics | Hybrid: Milestone or Genetec SDK + custom ML layer | Reuse battle-tested VMS core; differentiate on analytics. |
| > 500 cameras, vertical product | Build custom (this is VALT’s zone) | Margin and product-market fit only show up with full control of UX, RBAC, and pricing. |
| Regulated workflow (police, hospital, court) | Build custom or hybrid on-prem | CJIS/HIPAA audit logging and data residency are easier when you own the stack. |
| SaaS go-to-market (you sell the product) | Build custom | You cannot resell Verkada. You can resell what we build for you. |
Cost model — what an MVP and a production VMS actually run
We give ranges below, not single numbers, because cameras-per-tenant, ML scope, and compliance scope all dominate. The ranges reflect Fora Soft pricing using Agent Engineering — our internal AI-assisted delivery process — which lands consistently below the industry-average $132K, 13-month custom-software baseline analysts publish for projects of this size.
| Scope | What is included | Indicative range | Calendar |
|---|---|---|---|
| MVP — smart capture | ONVIF ingest, motion recording, web playback, RBAC, audit log | $60K–$110K | 3–4 months |
| Searchable evidence layer | ASR transcription, time-stamped notes, PDF export | +$25K–$45K | +1–2 months |
| ML analytics layer | Object/person detection, anomaly alerts, dashboards | +$30K–$70K | +2–3 months |
| Mobile-as-camera + iOS/Android client | Native apps, mobile RTSP encoder, push notifications | +$35K–$60K | +2 months |
| Compliance hardening (CJIS/HIPAA/SOC 2) | Encryption review, audit reports, vendor due-diligence pack | +$20K–$40K | +1 month |
For run-rate, plan on $4–$9 per camera per month for cloud bandwidth + storage at 1080p H.265 with motion recording, plus ML inference cost (Jetson Orin NX at the edge ~$350 hardware one-off; cloud GPU inference roughly $0.30 per camera per day at YOLOv8 quality).
Security & compliance — what regulated buyers will ask
VALT runs in three regulated verticals, and each one tested a different compliance angle.
1. CJIS for police use. Criminal-justice data demands tamper-evident audit logs, encryption in transit and at rest, and strict separation between agencies. We isolate per-tenant storage at the bucket level, sign every write to the audit log with a hash chain, and deny export without a justification field.
2. HIPAA for medical training. Patient-identifiable footage from simulation suites needs encryption keys controlled by the institution and break-glass auditing on every replay. We support per-tenant KMS, BAA-ready cloud regions, and one-click revocation.
3. FERPA for child advocacy and education. Footage of minors carries parental-consent requirements and short retention limits. We expose retention policies as per-folder TTLs the institution sets in the UI, not the engineer.
4. SOC 2 Type II. Enterprise procurement increasingly demands it as table-stakes. The audit cost typically lands around $25–$40K for the readiness assessment plus annual auditor fees — build the controls into sprint 1, not retrofit them in year 2.
5. GDPR for EU footage. Right-to-erasure on individuals is a real engineering ask — you need cryptographic key wipe at the segment level, plus a data-processing-agreement template ready for prospects.
Use cases that pay for the product
A common mistake in surveillance products is selling “security” when buyers are paying for “evidence,” “training,” or “process control.” The verticals below are the ones that actually fund custom VMS work.
Law enforcement and corrections
Police interrogations, body-camera evidence intake, court holding cells. PTZ presets cover the standard interrogation room layout; spoken-word search lets a detective find “he said the address” in seven seconds instead of seven hours; transcription PDF goes straight into the case file. CJIS audit log is the gating feature.
Medical and clinical training
Simulation centres at medical schools record OSCE exams, surgical residency drills, and standardized-patient encounters. Examiners zoom into a hand technique, time-stamp a feedback note, and export per-student PDF reports. The same workflow ports to nursing schools, paramedic academies, and dental simulation labs.
Child advocacy and forensic interviewing
Centres recording interviews with vulnerable minors run on extreme privacy and auditability. Two-camera coverage (face + room), tamper-evident audit, and short retention windows are non-negotiable. The PDF export becomes the artefact shared with prosecutors and child-protection officers.
Adjacent verticals worth scoping
Manufacturing safety: PPE-compliance object detection feeding alerts to floor managers. Retail loss prevention: dwell-time and exit-anomaly detection with LPR at the parking lot. Smart city: traffic counting, crowd density, vehicle classification. Drone-based aerial monitoring: covered in our DSI drones case study — a different camera class, identical pipeline downstream.
Mini case — what shipping VALT taught us
Situation. The client came to Fora Soft with a working but bandwidth-bound product: roughly 400 cameras, frequent stream stalls when more than 30 reviewers were live, and an evidence search that scanned filenames only. The roadmap demanded scaling to thousands of cameras, adding spoken-word search, and supporting iPads as cameras inside child-advocacy interview rooms — without tripling the cloud bill.
Plan. Twelve weeks. Sprint 1–2: replace the stream encoder with Wowza, switch cameras to H.265 ingest. Sprint 3–4: layer Amazon Transcribe + OpenSearch for word search. Sprint 5–6: ship native iOS/Android encoders so phones become RTSP cameras. Sprint 7–8: PTZ preset workflow, time-stamped notes, PDF export. Sprint 9–10: tenant-isolated audit logging, CJIS-grade encryption review, soak test at 2,500 cameras.
Outcome. Camera count rose from 400 to 2,500 (+525%); concurrent reviewers crossed 25,000 daily; bandwidth per camera dropped 48% thanks to H.265 + motion-only recording; revenue grew to the $9.7M/year run-rate the product holds today. The product now serves 650 organizations across police, medical, and child-advocacy verticals on the same multitenant codebase. Want a similar 12-week assessment for your VMS?
Edge AI vs. cloud inference for ML features
Once you add ML, you decide where it runs. Two principles dominate.
| Hardware | Throughput (YOLOv8) | Power | Cost | Best for |
|---|---|---|---|---|
| NVIDIA Jetson Orin NX | ~42 FPS | 15 W | $300–$400 | Multi-camera edge gateway, anomaly detection |
| NVIDIA Jetson Nano | ~12 FPS | 5–10 W | $150–$200 | Single-camera kiosk, prototype |
| Google Coral TPU | ~22 FPS | 2–4 W | $100–$150 | Battery-powered IoT, TFLite-only models |
| Cloud GPU pool (T4/L4) | 100s FPS per GPU | n/a | ~$0.30 per camera per day | 1,000+ cameras, frequent model retraining |
Reach for cloud GPU inference when: your model fleet changes monthly, customer-specific tuning is on the roadmap, or camera count exceeds ~1,000. Otherwise edge Jetson Orin is cheaper and avoids the bandwidth tax of pushing every frame to the cloud.
A decision framework — pick a VMS path in five questions
1. Are you reselling the product, or just using it? If you plan to sell it as SaaS, you must build — you cannot resell Verkada. If you only need it internally, COTS is almost always cheaper.
2. How many cameras in 24 months? Under 100 favors COTS. 100–500 favors hybrid (Milestone/Genetec SDK + custom UI). Over 500 with a vertical workflow favors a custom build like VALT.
3. What is the regulatory floor? CJIS, HIPAA, FERPA, GDPR, SOC 2 each pull architecture in slightly different directions. The earlier you fix the floor, the cheaper the build.
4. Latency target? Sub-second → WebRTC + edge inference. 5+ seconds OK → LL-HLS + cloud inference, half the cost.
5. ML model lifecycle? If the analytics customer asks change every quarter, plan a feedback-loop pipeline (label-store, retraining cadence) into sprint 0. Bolting it on later is the most expensive shortcut in this space.
Pitfalls we have watched VMS teams fall into
1. Underestimating bandwidth at peak. Average bitrate is comforting; peak-hour bitrate is what saturates your uplink. Provision 1.5× theoretical peak, not 1× theoretical average.
2. Ignoring time sync. Without NTP discipline across cameras, evidence chronology fragments. The audit log becomes inadmissible the day the first lawyer notices.
3. Bolting compliance on at the end. CJIS-grade audit logging adds 3× the cost retrofitted versus designed-in. The same is true for HIPAA encryption.
4. Skipping false-positive tuning. A motion-detection feature that fires on a passing cloud at 4 AM destroys adoption faster than no feature at all. Tune thresholds with the customer in week 2, not month 6.
5. Naive single-tier storage. 1080p 24/7 on a single S3 tier costs roughly 4× what tiered storage costs at 30-day retention. The CFO finds this in month 3 and the project bleeds.
KPIs — what to actually measure
Quality KPIs. Stream uptime per camera (target 99.5%+), end-to-end latency p95 (sub-second for ops, <5s for review), false-positive rate on anomaly alerts (under 5% by month 3), search recall on spoken-word queries (over 90% on clear audio).
Business KPIs. ARR per camera, gross margin per tenant, churn by vertical, expansion revenue from added features (transcription, ML upgrade tiers). VALT’s public number is $9.7M ARR / 2,500 cameras — roughly $3,800 per camera per year, which is the benchmark to beat.
Reliability KPIs. Mean time to detect a stuck stream (under 2 minutes); mean time to recover (under 10 minutes via auto-failover); audit-log completeness (100%, no gaps); RTO and RPO inside vendor SLA. Compliance auditors look at these first.
When NOT to build a custom VMS
We tell prospects to step back from a custom build when (a) the camera count is below 50 with no SaaS resale ambition; (b) the workflow has no vertical specialisation that COTS cannot handle; (c) the operations team is not ready to own a 24/7 streaming platform; or (d) the total budget under $80K cannot stretch beyond a hardened MVP.
In any of those cases, a Verkada or Eagle Eye Networks subscription combined with an integration project — we still help build the integration — gets you to value six months faster.
Want the build-vs-buy verdict in writing?
We’ll send a one-page recommendation after a short call — including a Wowza/Ant Media/Janus comparison and a realistic budget line for your camera count.
FAQ
What is the difference between a VMS and an intelligent video surveillance system?
A VMS (Video Management System) records, streams, and stores camera feeds. An intelligent video surveillance system layers AI on top — motion detection, object recognition, anomaly alerts, spoken-word search. VALT is a VMS with all four AI layers built in.
How long does it take to ship an MVP intelligent VMS with Fora Soft?
Three to four months for a smart-capture MVP (ONVIF ingest, motion recording, web playback, RBAC, audit log). Adding searchable evidence and ML analytics typically adds two to four months on top, depending on scope. Agent Engineering at Fora Soft compresses these timelines materially below the industry baseline of 13 months.
Can a smartphone really replace a fixed IP camera?
For pop-up sites, training rooms, and child-advocacy interviews — yes. VALT’s mobile SDK turns any iOS or Android device into an RTSP source feeding the same Wowza ingest. You lose PTZ and 24/7 mounting, but you gain deployable-in-minutes capture for less than $400 in hardware.
Why H.265 and ONVIF Profile T — not Profile S?
Profile S is the legacy ONVIF tier — H.264, weak authentication. Profile T is the modern tier — H.265 (50% storage savings), HTTPS, modern PTZ. Buyers running Profile S today should plan a refresh; firmware support is winding down across major camera vendors.
Should ML run on edge hardware or in the cloud?
Edge (Jetson Orin NX) when the camera fleet exceeds ~50 per site and bandwidth or privacy is the constraint. Cloud when the model fleet changes frequently, you want fewer moving parts, or per-customer model tuning is on the roadmap. We routinely ship hybrid — edge for hot-path detection, cloud for retraining and analytics.
Is VALT the only video-surveillance product Fora Soft has shipped?
No. We also built DSI Drones — aerial surveillance for land control — and contributed ML pipelines documented in our anomaly-detection guide. Around 40% of our active engineering work sits in video, streaming, and computer vision.
How is Fora Soft cheaper and faster than the industry baseline?
We use Agent Engineering — an AI-assisted internal delivery process that automates scaffolding, refactoring, and large parts of regression testing. The result is consistently shorter calendars and lower fully-loaded cost per feature. We will share concrete delivery examples on a scoping call.
What integrates with VALT-style systems out of the box?
Any ONVIF Profile S/T camera, any RTSP source, mobile devices via SDK, AWS Transcribe and Whisper for ASR, OpenSearch/Elastic for full-text search, and standard IdP (SAML/OIDC) for SSO. Custom integrations — Genetec, Milestone, Avigilon — we scope per project.
What to Read Next
ML deep dive
Top 7 anomaly-detection models for video surveillance
Which detection architectures we benchmark when adding ML to a VMS — and which we ship to production.
Implementation
Integrating video analytics with surveillance systems
A step-by-step playbook for retrofitting AI analytics on top of an existing camera fleet.
Standards
ONVIF Profile M & object detection
Why Profile M is the standard glue between cameras and analytics — and how to use it correctly.
Best practices
Real-time video processing with AI — best practices
The patterns we use to keep latency, accuracy, and cost in equilibrium for live ML pipelines.
Ready to ship your own intelligent video surveillance system?
A modern intelligent video surveillance system has four jobs — capture cleanly, search instantly, understand automatically, and prove provenance. VALT shows the four jobs working together at $9.7M ARR; the architecture, the codec choices, and the ONVIF tier are the levers that decide whether your build hits the same numbers or burns out at the bandwidth bill.
If your camera count, your vertical, or your SaaS go-to-market puts you in the build column of the table above, the fastest next step is a 30-minute call with the team that already shipped a system serving 25,000 daily users. We will walk the architecture, the codec math, and the realistic cost in one session — and tell you honestly when buying instead would be cheaper.
Talk to the team behind VALT
Book a 30-minute call. We will scope your VMS — cameras, codecs, ML, compliance, calendar, and budget — in one session.


.avif)
