This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.

Why this matters

If you are scoping an AI surveillance system, you have probably been told to choose edge or cloud — and the honest answer for almost every system larger than a few cameras is both, split deliberately. Get the split right and you hold millisecond reactions, a thin internet bill, and in-region privacy while still running cloud-scale forensic search; get it wrong and you either drown a business internet link streaming every camera up, or strand your heavy analytics on a camera chip that was never going to run them. This article gives you a plain-language model of the hybrid pattern — what runs where, what crosses the wire between the two, what happens when the link drops, and a decision path for drawing the line — so you can design a system that behaves well on its worst day, not just in the demo. It assumes you have met the three tiers already; if not, start with the comparison and come back.

The pattern in one picture

This article builds on a decision made in three earlier ones. The full tier-by-tier comparison — on-camera, edge-server, and cloud, and why the placement choice drives the whole system — lives in edge vs cloud analytics. The camera tier is detailed in on-camera edge AI, the middle tier in edge servers and on-prem AI appliances, and the cloud tier's cost in cloud video analytics cost. The conclusion all three reach is the same, and it is the starting point here: the wins are scattered across the tiers, so most serious systems combine them. This article is about how to combine them well.

Start with the two words. The edge is the computing that happens where the video is born — inside a smart camera, or on a server sitting on the same local network as the cameras. The cloud is a rented data center far away, reached over the public internet. The hybrid pattern simply refuses to pick one. It assigns each job to the tier built for it, and connects the two with a deliberately thin pipe.

Here is the shape almost every well-built system takes. The camera or a local box runs detection in real time and reacts on the spot — a line-cross fires an alert in tens of milliseconds without phoning anyone. The continuous recording stays local, on a recorder or the edge server's own disk, where storing video is cheap and fast. And the cloud receives only the distilled output — the metadata events, the short clips around each detection, and the occasional hard frame — which it uses for the work only a data center can do: searching months of footage, tracking a person across forty cameras, running a large model the camera could never host, and managing the whole fleet from one console. The heavy video never makes the trip; the meaning does.

Data-flow diagram of the hybrid pattern: cameras and an edge box keep detection and continuous recording local, while only metadata, event clips, and hard frames cross the internet to a cloud doing search, cross-camera analysis, and fleet management. Figure 1. The hybrid split in one picture. Detection and continuous recording stay inside the building, where compute is fast and storage is cheap; only a thin stream of metadata, event clips, and selected hard frames crosses the internet to the cloud, which does fleet-wide search, cross-camera reasoning, and management. Watch what crosses the WAN line — that thin pipe is the whole point.

Who does what: the division of labor

The hybrid pattern is, at heart, a list of which tier owns which job. The split is not arbitrary; it follows from what each tier is physically good at. The edge owns anything that must be fast, private, or always-on; the cloud owns anything that must be heavy, fleet-wide, or elastic. Here is the division most systems settle into.

Job Runs at the edge (camera or local box) Runs in the cloud Why it sits there
Real-time detection (person, vehicle, line-cross) Needs tens-of-ms reaction; no network round-trip
Continuous recording / retention Storing video locally is cheap; uplink can't carry it
First-pass filtering / motion gating Cuts the stream before anything crosses the wire
Face / plate blurring before upload Privacy by construction; minimizes what leaves
Short-term ring buffer (24–72 h) Survives an internet outage; instant local playback
Cross-camera tracking / re-identification Needs to reason over many cameras at once
Forensic search over months of footage Data-center compute and index, run occasionally
Large vision-language scene description Model far too heavy for a camera chip
Fleet management, model updates, dashboards One console over every site; update in one place
Long-term cold storage of alert clips Cheap per-gigabyte archive, kept off-site

Table 1. The hybrid division of labor. The edge keeps the fast, private, always-on jobs and the heavy raw video; the cloud takes the heavy, fleet-wide, occasional jobs and only the distilled output. The line between the two columns is the design decision this whole article is about.

Notice the pattern in the table. Everything in the edge column is either time-critical (a reaction that cannot wait for a round-trip), bandwidth-heavy (the raw video itself), or privacy-sensitive (recognizable faces). Everything in the cloud column is either compute-heavy (a model or a search no camera could run), fleet-wide (reasoning that spans cameras and sites), or occasional (work you do now and then, where the cloud's pay-as-you-go billing is a gift rather than a penalty). When you are unsure where a new job belongs, ask which of those six words describes it, and the column picks itself.

Two-column diagram showing edge jobs (real-time detection, recording, filtering, blurring, ring buffer) on the left and cloud jobs (cross-camera tracking, forensic search, large models, fleet management, cold storage) on the right, with the dividing principle labeled. Figure 2. The split as two columns. The edge takes anything time-critical, bandwidth-heavy, or privacy-sensitive; the cloud takes anything compute-heavy, fleet-wide, or occasional. The same six words decide where any new analytic belongs.

The thin pipe: what actually crosses the wire

The hybrid pattern lives or dies on keeping the connection between the two tiers thin. If you send full video up, you have not built a hybrid system — you have built a cloud system with extra steps, and you pay the cloud's full bandwidth and compute bill. So it is worth being precise about exactly what crosses, because each piece is small for a reason.

Three kinds of data make the trip, and all three are tiny next to raw video. The first is metadata — the structured result of a detection: an object type, a bounding-box location, a timestamp, a confidence score. A detection event is a few hundred bytes to a few kilobytes, not the one-to-four megabits per second of the video stream itself. The second is event clips — a short slice of actual video, perhaps ten seconds around a detection, sent up so an operator or a heavier model can look at the moment that mattered. The third is embeddings and hard frames — a few kilobytes of mathematical "fingerprint" per detection that lets the cloud search and match faces or vehicles across cameras, plus the occasional frame the edge model was unsure about, forwarded for a second opinion.

Walk the bandwidth math and the size of the win is obvious. A site sending its leading cloud-surveillance vendors' style of thin telemetry sits well under 100 kilobits per second per camera even during busy periods — Verkada, for instance, sends encrypted thumbnails and metadata roughly every twenty seconds at no more than about 20 kilobits per second per camera, letting over a hundred cameras share a single 2-megabit link. Compare that with the same cameras streaming full video for cloud analysis:

50 cameras × 2 Mbps (full video) = 100 Mbps of sustained upload 50 cameras × ~0.1 Mbps (hybrid telemetry) = ~5 Mbps, mostly in bursts

That is a 95% cut in the internet upload, for the same cameras and the same detections — purely by changing where the looking happens and what gets sent. The full continuous video still exists; it simply stays on the local recorder, where a gigabit network carries it for free. The arithmetic of how much that local recording costs to store, and for how long, is worked in full in the surveillance storage and retention math; the point here is that the bandwidth bill and the storage bill go to two different places, and the hybrid pattern is what lets you send each to the cheaper one.

The triage cascade: let the edge decide what the cloud sees

The single most important idea in hybrid analytics has a name in the research literature: the cascade, or triage. It is the reason a hybrid system's cloud compute bill is small, not just its bandwidth bill, and it works like a hospital emergency room. A quick, cheap first look sorts every arrival; only the cases that need a specialist get sent to one.

In a surveillance system, the quick first look is a light model running on the camera or edge box — a small object detector that runs on the modest chip the edge carries and asks one cheap question of every frame: is anything here worth a closer look? On the overwhelming majority of frames the answer is no — an empty corridor, a parking lot at 3 a.m., a quiet perimeter — and those frames never leave the building. Only when the light model sees something, or is unsure what it sees, does it escalate: the frame or clip is sent up to a heavy model in the cloud — a larger, more accurate detector, a cross-camera re-identification model, or a vision-language model that can describe a complex scene. The light model is a filter; the heavy model is the specialist it calls only when needed.

The economics of this are dramatic and they compound with the bandwidth win. Suppose the light edge detector finds something worth escalating on 5% of frames — a generous figure for most scenes. Then the expensive cloud model runs on 5% of the volume it would have processed if you streamed everything up, and the cloud compute bill — the per-minute meter or the rented-GPU hours detailed in cloud video analytics cost — falls by roughly the same 95%. You are no longer paying a data center to stare at empty corridors. The cascade is why the hybrid pattern is cheaper than cloud on both axes at once: the thin pipe cuts bandwidth, and the edge filter cuts compute.

Funnel diagram of the triage cascade: a light edge detector sees every frame, passes about 95 percent as nothing, and escalates only the few percent of events or low-confidence frames up to a heavy cloud model. Figure 3. The triage cascade. A cheap edge detector looks at every frame and lets the empty ones go; only events and low-confidence frames — a few percent — climb to the heavy cloud model. The filter cuts the cloud's bandwidth and compute bills at the same time.

One boundary to keep straight, because it is the line this whole section of Learn respects. How those models are built, trained, distilled to fit a camera, and tuned for accuracy is the model-engineering layer, covered in our AI for Video Engineering section under real-time edge vs cloud AI deployment. This article owns where each model runs and what passes between them — the deployment pattern — not the model internals. The cascade is an architecture; the detectors inside it are engineered elsewhere.

Keep recording local: the store-and-forward seam

Here is the question that separates a hybrid design that survives contact with reality from one that looks good on a slide: what happens when the internet connection drops? A pure-cloud system goes blind — no uplink, no recording, no analytics. A hybrid system barely notices, and the reason is the most important engineering detail in the pattern: the edge keeps working on its own, and catches the cloud up later. This is called store-and-forward, and it is why "keep recording local" is the load-bearing wall of the whole design.

It rests on two local capabilities. The first is the ring buffer — a fixed block of local storage, typically holding the last 24 to 72 hours, that overwrites its oldest footage to make room for new (the default behavior of every mainstream recorder). Because recording is local, an internet outage does not interrupt it for a second; the cameras keep filling the buffer regardless of the link. The second is buffered upload — when the connection to the cloud is down, the edge holds the metadata and event clips it would have sent and forwards them when the link returns, in order, nothing lost.

This is not a theoretical nicety; it is exactly how the real "cloud" systems are built. An Eagle Eye Networks bridge — an on-premise box between the cameras and the cloud — records video to local storage first, specifically to buffer it and back up the latest files in case the internet connection fails, then handles the encryption, deduplication, bandwidth management, and intelligent upload. The AWS IoT Greengrass edge runtime, used to run analytics on local hardware, is built to "operate with intermittent connectivity": it runs its processing locally and caches the messages destined for the cloud until connectivity is restored, then synchronizes. The pattern is industry-standard precisely because a surveillance system that loses footage during an outage has failed at its one job.

Timeline diagram of store-and-forward: while the link is up, edge events stream to the cloud; when the link drops, the ring buffer keeps recording and events queue locally; when the link returns, the queue forwards in order with no footage lost. Figure 4. Store-and-forward across an outage. The local ring buffer keeps recording through the whole event; cloud-bound metadata and clips queue locally while the link is down and forward in order when it returns. The system never stops recording and never loses an event — the property a pure-cloud design cannot offer.

The design lesson is to size the local buffer for your worst realistic outage, not your average one. If a site's internet can be down for a day after a storm, a 24-hour ring buffer is cutting it close and a 72-hour one is prudent; if the link is a flaky cellular connection at a remote perimeter gate, local storage is not a backup but the primary, with the cloud as an eventual archive. Edge recording earns its keep most at exactly the sites where the link is worst — remote gates, detached buildings, anything on cellular — which is the opposite of where a cloud-first instinct would invest.

A worked example: 50 cameras, the split

Numbers make the pattern concrete. Take a realistic site — 50 cameras, each a 4-megapixel unit at about 2 megabits per second in the efficient H.265 codec — and route it three ways. (We reuse the same 50-camera, 2-Mbps baseline as the cloud-cost article so the figures line up across the block.)

Pure cloud, for reference. Every camera streams full video up, continuously, and the cloud analyzes all of it:

50 cameras × 2 Mbps = 100 Mbps of sustained upload, 24 hours a day

plus a cloud compute bill that runs every minute of every camera — the meter that, on a per-minute managed API, reaches into the thousands of dollars per camera per month (see cloud video analytics cost). The 100 Mbps alone can saturate a business internet link before a single detection is paid for.

The hybrid split. Now run detection at the edge, keep continuous recording on a local recorder, and send the cloud only telemetry and event clips. The continuous video — the heavy part — never touches the internet:

Recording: 50 × 2 Mbps stays on the local network = 0 Mbps to the internet Telemetry + clips: 50 × ~0.1 Mbps = ~5 Mbps to the cloud, in bursts

The internet upload drops from 100 Mbps to about 5 Mbps — a 95% cut — and the cloud now sees only the few percent of footage the edge flagged, so its compute bill falls by roughly the same proportion via the cascade. The cameras still record around the clock; the operations team still gets cloud search and cross-camera tracking; the heavy lifting that needs a data center still happens in one. What changed is that the raw video stopped commuting.

The point of the example is not the exact figures, which move with your cameras, scenes, and retention. It is the shape: the hybrid pattern routes the heavy, continuous, cheap-to-store work to the local network and the light, occasional, expensive-to-compute work to the cloud, and so pays the low price on both. The full cross-tier cost model — every line item per tier, the break-even points, and how retention and resolution multiply them — is the subject of the economics of analytics; here the lesson is just that the split is what makes the arithmetic kind.

Where to draw the line: a decision path

The hybrid pattern is a spectrum, not a single recipe — the line between edge and cloud sits in a different place for a perimeter gate than for a retail back-office. Drawing it is a short series of questions, taken in order, hardest constraints first.

The first gate is timing. Does the job drive an immediate action — a deterrent light, a siren, a safety stop, an instant operator alert? If a reaction in tens of milliseconds matters, that job runs at the edge, full stop, because a round-trip to a data center adds hundreds of milliseconds the moment cannot spare. If the job is reporting, search, or after-the-fact insight, the cloud's delay is harmless and its power is welcome.

The second gate is privacy and residency. Does the job touch recognizable faces, license plates, or other biometric data, or does a rule forbid that footage leaving the building or the country? If so, the recognizable video stays at the edge, and only de-identified metadata or blurred clips cross the wire — a posture we will see in a moment is also what the law leans toward. The third question is weight: a light, stable detector fits the camera or edge box; a model too heavy for that hardware, or one that must reason across many cameras at once, belongs in the cloud. The fourth is frequency: continuous work is cheapest at the edge, where you pay once for hardware; occasional or bursty work is cheapest in the cloud, where you pay only when you use it.

Decision tree for drawing the edge-cloud line: timing-critical jobs go to the edge, biometric or residency-bound video stays local, heavy or cross-camera models go to the cloud, and continuous work stays at the edge while occasional work goes to the cloud. Figure 5. Where to draw the line. Resolve timing and privacy first — they can pin a job to the edge regardless of anything else — then weigh model weight and frequency. Most real systems exit this tree with a split, not a single tier.

Walk that path for a real campus and the line draws itself differently per camera. The perimeter cameras need edge speed and an on-site ring buffer; the lobby's face-matching, if it is lawful at all, stays on-premises for residency; the analytics team's monthly forensic search and cross-building person-tracking run in the cloud over the thin stream of clips and embeddings the edge sent up. No camera streams its full video to a data center, and nothing time-critical waits on the internet. That is the hybrid pattern working as designed.

The standard that keeps the split vendor-neutral

A fair worry about splitting work across tiers is lock-in: if the camera's detection, the edge box, and the cloud all have to speak one vendor's private format, the "hybrid" system is really a single-vendor system wearing a costume. The standards layer is what prevents that, and this section of Learn owns it. ONVIF is the common language that lets cameras and software from different makers understand each other, and one ONVIF profile is built for exactly the data the hybrid pattern moves. ONVIF Profile M standardizes the analytics metadata and events that detections produce, and — the detail that matters here — a Profile M conformant consumer of that metadata can be a camera, a server, or a cloud service, not only a device on the local network (ONVIF, Profile M Specification). In plain terms, the standard was designed so the same metadata interface works whether the detection happened on the camera, on the edge server, or in the cloud, which is precisely the boundary the hybrid pattern crosses.

The usual ONVIF caution applies, and it is worth repeating because it is widely misunderstood: conformance guarantees a baseline, not every feature. Two products that share Profile M will reliably exchange standard metadata; a vendor's special analytic or proprietary attribute may still need that maker's own software kit. Treat the profile as the floor both sides stand on, not the ceiling. The full standards treatment is in events, metadata, and the ONVIF analytics interface, and the commercial overview is in our blog on ONVIF profiles in security systems. A hybrid system built on Profile M can mix an Axis camera, a third-party edge box, and a cloud analytics service, and the detections still arrive in a shape every part understands.

Privacy by the split: minimize what leaves

The hybrid pattern has a quiet legal advantage that is worth making explicit, because it turns a compliance burden into a side effect of a good architecture. The principle is data minimisation, and it is not optional advice — it is written into the EU's General Data Protection Regulation (GDPR, Regulation (EU) 2016/679), which requires that personal data be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed" (GDPR Art. 5(1)(c)). Video of identifiable people is personal data; sending an entire fleet's continuous footage to a third-party cloud when a stream of metadata and a few clips would do is, almost by definition, sending more than is necessary. The hybrid pattern — keep the recognizable video local, send the cloud only the distilled output — is data minimisation expressed as an architecture.

The advantage sharpens when the video crosses a border or touches biometrics. Under GDPR Chapter V (Arts. 44–50), transferring personal data outside the European Economic Area needs a specific legal mechanism; keeping the recognizable footage in the building sidesteps the question for that footage entirely, because what crosses the border is de-identified metadata, not faces. And for biometric identification specifically, the legal gate is high: the EU AI Act (Regulation (EU) 2024/1689) prohibits real-time remote biometric identification in public spaces with narrow exceptions and classes after-the-fact biometric identification as high-risk (its high-risk obligations apply from 2 August 2026), while in Illinois the Biometric Information Privacy Act (BIPA, 740 ILCS 14) restricts capturing faceprints and, unusually, lets individuals sue directly with statutory damages. None of these rules is escaped by sending the work to a cloud; the safest design keeps biometric processing on hardware you control and minimizes what leaves. The deeper treatments are in GDPR for video surveillance and BIPA and US biometric privacy law. This is engineering guidance, not legal advice; confirm specifics with qualified counsel.

A common mistake to avoid

The costliest hybrid mistake is the fake hybrid: a system marketed as edge-plus-cloud that quietly streams full video up anyway — for "cloud recording", for "backup", or because the integration was easier than building a real split — and so pays the cloud's full bandwidth and compute bill while claiming the edge's economy. The tell is the uplink: if a 50-camera site is pushing tens of megabits per second continuously, the video is going up whole, and you have a cloud system, not a hybrid one. The fix is to be ruthless about the thin pipe — metadata and event clips cross the wire, continuous video stays on the local recorder — and to verify it by measuring the sustained upload, not by trusting the label. The companion mistake is the opposite over-correction: forcing everything onto the edge, then being unable to run the cross-camera search the operations team actually needs because a wall-mounted chip was never going to host it. The discipline is the decision path above: split by timing, privacy, weight, and frequency, and let each job land where it is cheapest and fastest.

Where Fora Soft fits in

Fora Soft has built real-time video, streaming, and computer-vision software since 2005, across 625+ shipped projects, and the edge-cloud split is the architecture we reach for most in surveillance work, because off-the-shelf platforms force their own line and it rarely matches a real site. Teams come to us when a "cloud" product is saturating a site's uplink and the bill has a comma in it, when biometric analytics must stay on-premises to satisfy a residency rule a cloud product cannot meet, or when a multi-site fleet needs edge detection feeding cloud search without streaming every feed up. We build the custom pipeline — a light detector and continuous recording at the edge, ONVIF Profile M metadata and event clips into the VMS, store-and-forward that survives an outage, and only the few percent of footage that matters sent to a cloud model — and the framing we lead with is always how the system behaves under real load first: the latency you can hold, the upload you actually consume, and the realistic precision and recall in your lighting, never a demo's perfect number. A split that survives the worst day beats a tidy diagram that does not.

What to read next

Call to action

References

  1. ONVIF — "Profile M — Metadata and events for analytics applications" (standardizes analytics metadata and events; a conformant consumer can be an edge device, a server, or a cloud service, and a client can be a VMS, NVR, or cloud service — the basis for a vendor-neutral hybrid split where the same metadata interface works across edge and cloud. Profile M Specification v1.1, 2024). Primary standard (tier 1). https://www.onvif.org/profiles/profile-m/
  2. European Union — "GDPR, Regulation (EU) 2016/679, Art. 5(1)(c) (data minimisation) and Art. 5(1)(e) (storage limitation)" (personal data must be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed", and kept no longer than necessary — the principle that makes sending only metadata and clips, not full video, the compliant default). Primary law (tier 1). https://eur-lex.europa.eu/eli/reg/2016/679/oj
  3. European Union — "GDPR, Regulation (EU) 2016/679, Chapter V (Arts. 44–50)" (restricts transfers of personal data outside the EEA absent a legal mechanism; keeping recognizable footage in-region means only de-identified metadata crosses the border). Primary law (tier 1). https://eur-lex.europa.eu/eli/reg/2016/679/oj
  4. European Union — "Artificial Intelligence Act, Regulation (EU) 2024/1689, Art. 5 and Annex III" (real-time remote biometric identification in public spaces prohibited with narrow exceptions; post-hoc biometric identification high-risk; high-risk obligations apply from 2 August 2026 — the biometric gate a hybrid design clears by keeping biometric processing local). Primary law (tier 1). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  5. Illinois General Assembly — "Biometric Information Privacy Act (BIPA), 740 ILCS 14" (restricts collection of biometric identifiers such as faceprints; provides a private right of action with statutory damages — the US biometric gate that motivates keeping face/plate processing at the edge). Primary law (tier 1). https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004
  6. AWS — "AWS IoT Greengrass FAQs and developer guide" (the edge runtime runs local processing and ML inference, operates with intermittent connectivity, and buffers/spools messages destined for the cloud while offline — synchronizing when the link returns; the first-party basis for the store-and-forward seam). First-party engineering (tier 3). https://aws.amazon.com/greengrass/faqs/
  7. NVIDIA — "DeepStream SDK / Metropolis — multi-stream video analytics" (a mainstream inference GPU runs roughly 16–40 simultaneous 1080p streams with a light detector at full frame rate; the basis for sizing the edge-server side of the split and the cascade's edge filter). First-party engineering (tier 3). https://developer.nvidia.com/deepstream-sdk
  8. Eagle Eye Networks — "Bridges and the Cloud VMS architecture" (the on-premise bridge records video to local storage first to buffer it and back up the latest files in case the internet fails, then handles encryption, deduplication, bandwidth management, and intelligent upload — real-world store-and-forward evidence that leading cloud systems are hybrid). Vendor engineering (tier 4). https://www.een.com/hardware/bridges/
  9. Verkada — "Reducing bandwidth consumption of a cloud camera to 20 kbps" (edge-analytic cameras send encrypted thumbnails and metadata roughly every 20 seconds at no more than ~20 kbps per camera, letting 100+ cameras share a ~2 Mbps link — the thin-pipe figure behind the <100 kbps/camera hybrid uplink). Vendor engineering (tier 4). https://www.verkada.com/blog/reducing-bandwidth-consumption-cloud-camera/
  10. arXiv — "Edge Video Analytics: A Survey on Applications, Systems and Enabling Techniques" and "Croesus: Multi-Stage Processing and Transactions for Video-Analytics in Edge-Cloud Systems" (the cascade/triage pattern: a light model on the edge filters frames and escalates only low-confidence or event frames to a heavy model in the cloud — the formal basis for the article's triage-cascade economics). Academic / educational (tier 6). https://arxiv.org/pdf/2211.15751
  11. asmag.com — "Hybrid cloud-edge deployments: a resource guide for security integrators in 2026" / Security Info Watch — "Video analytics: edge vs. cloud vs. on-prem" (industry reporting that hybrid edge-plus-cloud is the dominant 2026 deployment model for serious surveillance products — market-reality orientation, not a primary citation). Institutional/analyst (tier 5). https://www.asmag.com/showpost/35496.aspx