Custom intercom software architecture with video streaming, authentication, and visitor management

Key takeaways

The video intercom market is ~$3B in 2026 and growing 10–16% per year. Cloud-native and SIP/WebRTC architectures are eating proprietary systems — interoperability is now the buying criterion.

Cloud TCO beats on-prem by 25–35% over 5 years. But cloud-only breaks when the internet does — a hybrid that keeps local SIP fallback is the production default.

Face recognition is table-stakes and a compliance landmine. 99.5%+ accuracy is achievable; GDPR Art. 22 + the EU AI Act (enforcement from Aug 2026) mean you ship a human-in-loop flow or you don’t ship in Europe.

Mobile push latency is the #1 UX complaint. Sub-2-second answer on residents’ phones is the bar; miss it and residents stop answering the door.

Fora Soft ships cloud video products for a living. Real-time video, WebRTC since 2012, 625+ projects shipped, and a track record on surveillance products like Netcam Studio that map directly onto modern video intercom architectures.

Why Fora Soft wrote this playbook

Fora Soft has shipped video-first software for twenty years. We’ve built Netcam Studio, a video surveillance product that traces its roots to WebcamXP — one of the earliest IP-camera apps on the market. We build WebRTC and SIP infrastructure for clients every week (see our WebRTC services and realtime voice integration guide). And modern video intercoms are where WebRTC, SIP, IP cameras, and access control all meet.

This article is the playbook we use when a PropTech founder, a building integrator, or a security-platform CTO asks: “What does a modern video intercom product look like, what does it cost to build, and where do teams get it wrong?” You’ll see the reference architecture, the protocol picks, AI-feature sanity, compliance landmines, a cost model, and a five-question decision framework. Jump to section 11 if you just need the buy-vs-build answer.

Related reading: IoT + video surveillance architecture, industrial video surveillance AI.

Planning a video intercom SaaS or smart-building integration?

Tell us about your deployment (residential, commercial, multi-tenant) and we’ll sketch architecture and cost in a 30-minute call.

Book a 30-min call → WhatsApp → Email us →

What changed: from buzzer to cloud

Three shifts drove the last five years of video intercom evolution. First, PoE and IP cameras made the door station a general-purpose computer. Second, WebRTC and SIP trunking let calls hit any phone anywhere — not just the wired monitor inside the apartment. Third, edge AI — face recognition, package detection — dropped in price to the point where it’s the default feature-set residents expect.

The result: a modern video intercom is a video product, an access-control product, an IoT device, and a compliance artifact, all in one front panel. Building one as a “doorbell with a camera” is how projects die.

The 2026 market snapshot

Metric 2025 2026 (projection) Source / note
Global video intercom market $2.8B $3.0–3.1B 10–16% CAGR (Mordor, R&M)
IP intercom share $2.35B $2.55B+ Wise Guy, IP segment 7.8% CAGR
Cloud-native deployments ~45% 55%+ Swiftlane & IPVM trend data
Face recognition adoption ~30% 45%+ Premium-segment standard; 99.5%+ precision
Typical call-answer latency target < 2 s < 1.5 s WebRTC glass-to-glass, mobile app

Reference architecture for a modern video intercom

Five logical tiers, with clean seams between them. If your build blurs two tiers you’ll pay for it at integration time.

Tier Responsibility Typical tech
1. Door station / front panel Capture video + audio, authenticate, drive lock IP cam (720p–2K), mic array, speaker, RFID/NFC, PoE+ power, SIP client, edge AI
2. In-unit monitor (optional) Wired answer point, guaranteed response 7–10” Android/Linux panel, PoE, local SIP fallback
3. Cloud backend Signaling, directory, recording, analytics SIP proxy, WebRTC SFU, REST/GraphQL API, S3/Wasabi, RBAC
4. Mobile & web apps Residents answer, guests, admin dashboard iOS / Android / React web, WebRTC, PushKit / VoIP Push, FCM
5. Integration adapters Access control, PMS, smart home, surveillance OSDP, Wiegand, Matter/Thread, HomeKit, Alexa, Yardi/AppFolio/Entrata, ONVIF

Reach for a cloud-native architecture when: you’re a SaaS, an ISV, or a property-manager-facing product. Reach for a hybrid with a local NVR when you’re selling into buildings with strict uptime or privacy requirements (luxury residential, government, regulated healthcare).

Transports: SIP, WebRTC, and when to use each

Two call-delivery paths are in common use. Most serious products use both, for different endpoints.

Dimension SIP WebRTC
Best for Door station, in-unit monitor, landline phones Mobile app, web dashboard, property-manager console
Latency 300–800 ms 150–500 ms
NAT traversal STUN / TURN + SIP ALG headaches Built-in ICE + TURN
Codec G.711 / G.729 / Opus; H.264 Opus + VP8/VP9/H.264
When to bridge Landlines, legacy PBX, some door stations Anything browser-native

Common pattern: door stations and in-unit monitors talk SIP to your backend; mobile and web clients talk WebRTC; the backend bridges. That gets you the reliability of SIP on the door and the latency of WebRTC on the phone.

AI features that residents actually use

Not every “AI” feature earns its place. These are the four that land in real deployments:

1. Face recognition with liveness / anti-spoof

Hands-free entry for authorized people. 99.5%+ accuracy at modern models, under 200 ms on-edge inference (Akuvox E18C class hardware). Anti-spoof via depth-sensing or IR liveness is table-stakes — photo attacks trivially defeat detection-only models. Compliance: human-in-loop for denied access, DPIA on file, opt-out alternative.

2. Package / delivery detection

Solves the #1 pain point in multi-tenant buildings. A classifier that differentiates people, packages, vehicles, and pets cuts nuisance push notifications by 70% and unlocks a “porch pirate” alert path. Runs on-edge on the door station.

3. QR / one-time PIN entry

Best solution for delivery drivers and short-term guests. Generate a time-boxed code in the resident’s app, deliver via SMS / email, camera scans QR at the door. No shared PINs, no forgotten codes, full audit trail.

4. Video voicemail + missed-call clip

Nobody’s phone rings every time. When no resident answers, record 30 seconds pre/post-press and notify. This single feature is often the difference between “we love it” and “we turned it off.”

The protocol stack — what you’ll touch

Protocol / standard Role Watch-out
SIP (RFC 3261) Call signaling Disable SIP ALG on customer routers or prepare for one-way audio
WebRTC Browser / mobile media Run your own TURN server; don’t rely on public relays
ONVIF Profile S / T Surveillance integration Essential for multi-vendor, optional for single-vendor
PoE+ (802.3at) / PoE++ (802.3bt) Power door station (30–90 W) Budget IR LEDs + OLED screen; long cable runs derate
Matter / Thread Smart-home interoperability Intercom profile limited in Matter 1.5; partial today
OSDP / Wiegand Access-control readers OSDP v2 encrypts; Wiegand is plaintext and spoofable
APNs VoIP Push / FCM Wake-up mobile on incoming call iOS CallKit + PushKit only; never regular push for ringing

The vendor landscape in 2026

Short map of the players we see in live deployments and where each sits in the value chain.

Vendor Best for Watch-outs
Aiphone (IX / IXG) Commercial and residential; well-integrated to major VMS Hardware cost, slower cloud roadmap
Akuvox Face-recognition heavy deployments, mid-range price Firmware quality varies by SKU
2N Strong SIP compatibility, PBX-friendly Premium price
DoorBird Small-building and single-family segment Cloud-tied features, API depth varies
Comelit / Siedle European builds, 2-wire retrofit paths Slower cloud roadmap, higher integration cost
ButterflyMX / Swiftlane / Latch Multi-tenant residential SaaS Tightly coupled hardware + SaaS; margin pressure on ISVs building alternatives

Integrations that actually close deals

For B2B multi-tenant products, five integrations consistently separate a demo from a signed contract:

1. Property management (PMS). Yardi, Entrata, AppFolio, Buildium. Resident directories, move-in / move-out automation, lease-status sync. If you haven’t integrated at least one, expect to lose the deal to a vendor that has.

2. Access control. Salto, Brivo, Kisi, LenelS2, HID. OSDP secure channel, not Wiegand. One shared credential set for door, elevator, and amenities.

3. Smart home / voice. HomeKit, Alexa, Google Home, Matter. Residents want the doorbell to ring on their kitchen speaker, not a single app.

4. Surveillance / VMS. ONVIF Profile S/T to Milestone, Genetec, Eagle Eye. Share camera feeds with the existing security platform; avoid the operator-tab-explosion that kills adoption.

5. Delivery platforms. Amazon Key, UPS, FedEx, Instacart. One-time QR codes or short-lived OAuth-style tokens for couriers.

Voice-AI concierge at the door. OpenAI Realtime and similar models (see our realtime voice guide) are already shipping in pilot deployments as “AI receptionists.” Visitors state their purpose; the model handles delivery, guest, service worker and similar cases without bothering residents.

Matter intercom profile maturity. Matter 1.5 added partial intercom support; 2.0 is expected to formalize the profile. When it does, cross-brand answer-the-door-from-a-smart-speaker stops being a project and becomes a checkbox.

EU AI Act enforcement (August 2026). High-risk biometric systems must have documentation, human oversight, and transparency notices. Every face-recognition intercom sold in the EU will need a compliance dossier.

5G fixed-wireless as primary uplink. For buildings without fiber, 5G FWA + a second LTE failover is becoming the default WAN. Plan your backend for dual-path uplinks.

Mini case: cloud-first video intercom for a 200-unit residential

Situation. A PropTech operator managed 200 apartments with a decade-old 2-wire audio intercom. Residents ignored it, deliveries piled up, and the operator was getting complaint tickets weekly. They wanted a single system — web admin, resident mobile app, cloud directory, delivery access — without pulling the existing audio-only panels on day one.

10-week plan. Week 1–2: retire-and-replace audit, lobby door station (PoE), 2-wire bridge for in-unit panels. Week 3–5: cloud backend on Kubernetes, SIP proxy, resident mobile app in React Native, VoIP push. Week 6–7: face recognition opt-in enrollment, one-time QR for deliveries, Yardi directory sync. Week 8–10: pilot, UX tuning, phased rollout across two buildings.

Outcome. Delivery-related tickets dropped 78%. Median call-answer time on residents’ phones landed at 1.6 seconds (goal was < 2). Opt-in face recognition adoption hit 54% in the first month. Cost came in 32% below the incumbent on-prem bid on a 5-year TCO.

Cost model: hardware + software + integration

Line item Residential 50 units Commercial 200 doors Notes
Door stations (hw) $1,200–4,000 $40k–120k $400–2,500 per unit depending on premium features
Cabling + labor $3k–8k $30k–80k New PoE pulls at $50–200/drop
Cloud / SaaS $400–1,200/mo $2k–5k/mo $8–25 per door/month for SaaS
Integration (PMS, access) $2k–6k one-time $6k–20k one-time Yardi / Entrata / AppFolio / custom
Custom SaaS build (MVP) n/a $120k–280k 20–32 weeks with Agent Engineering on the scaffolding

For an ISV building the SaaS itself (not just deploying one), the meaningful number is the custom build row. Our Agent Engineering practice typically trims 20–30% off baseline timelines for the SIP proxy, push gateway, device provisioning, and directory sync scaffolding, all of which is heavily pattern-based.

Want a realistic MVP scope for your intercom SaaS?

Send us your target deployment size and compliance region — we’ll come back with a 12-week plan and rough cost in a 30-minute call.

Book a 30-min scoping call → WhatsApp → Email us →

A decision framework in five questions

Q1. Is this a product or a deployment? Product (your SaaS, many buildings): cloud-native, multi-tenant, ISV-grade. Deployment (one building, internal): off-the-shelf Aiphone / Akuvox / DoorBird + integration.

Q2. What’s your uptime requirement? If tenants can tolerate “delivery-only” during internet outages, cloud-only is fine. If building access must always work, hybrid with local SIP fallback is mandatory.

Q3. What’s the compliance region? EU / UK: DPIA, GDPR Art. 22, EU AI Act posture. US: state-level (IL BIPA, CA CCPA). APAC: varies widely. Your legal posture dictates whether face recognition ships at all.

Q4. What’s the existing wiring? PoE Cat6 in every stack: greenfield is easy. 2-wire legacy: budget a bridge or plan a rip-and-replace phased rollout. Some verticals (luxury residential) accept the disruption; others (budget rental) don’t.

Q5. Where does the call go? In-unit monitors only: SIP-only is enough. Mobile apps: WebRTC + VoIP push. Landlines: SIP trunk to PSTN. Most modern deployments answer all three at once.

Five pitfalls we see sink video intercom projects

1. SIP NAT traversal black holes. Symmetric NAT plus SIP ALG eats 30% of calls on residential ISPs. Run a TURN server, disable ALG at provisioning, and test from live routers before declaring victory.

2. Mobile push delivery delays. Android battery-optimization “Doze” and iOS throttling can turn a 200 ms round-trip into 8 seconds. Use VoIP-class push (iOS PushKit + CallKit, Android high-priority FCM with Firebase Cloud Messaging) — not regular notifications.

3. PoE budget exhaustion. 30 W PoE+ covers a basic station; add IR LEDs, a heated touchscreen, and a PTZ motor and you’re on 60–90 W territory. Spec PoE++ (802.3bt) at the switch or pick wall-powered variants.

4. Skipping the offline mode. Cloud goes down; the door must still work. Local SIP peer-to-peer between door station and in-unit monitor, plus queued events that sync on recovery, are non-negotiable in production.

5. Shipping face recognition without a DPIA. In the EU, GDPR Art. 22 means a denied-access decision cannot be solely automated; the AI Act adds documentation duties. Ship with a DPIA, opt-out flow, and human review or don’t ship.

KPIs to track from day one

Quality KPIs. Call-answer latency P95 on mobile (target < 2 s), face-recognition true-positive rate (> 99%), false-reject rate (< 0.5%), package-detection precision (> 92%).

Business KPIs. Monthly answer-rate per resident (> 75%), delivery-related ticket volume trend (target month-over-month decrease), opt-in rate for face recognition (benchmark: 40–60%), time-to-provision a new unit (< 10 minutes self-serve).

Reliability KPIs. Door-station uptime (> 99.9%), push-delivery success rate (> 98%), offline-mode fallback success (> 99.5%), integration-sync freshness with PMS (< 5 min lag).

Security, compliance, and accessibility

GDPR + EU AI Act. Face data is “special category.” Require consent, keep a DPIA on file, never auto-deny entry solely on model output. AI Act enforcement on high-risk biometric systems starts August 2026 — plan your posture now, not in July.

US biometric privacy. Illinois BIPA, California CCPA, Texas CUBI, Washington HB 1493 all impose duties on biometric capture. Consent forms, disclosure, data-minimization, and audit logs are non-negotiable.

Accessibility (ADA + EN 301 549). Audio + visual feedback on door station, tactile buttons, hearing-loop support, color-contrast compliance in the app. In new commercial builds the building inspector may actively check.

Fire code + egress. Door hardware must fail-safe open on power loss where fire code requires. Your software cannot block egress — not even for unpaid rent.

When NOT to build your own video intercom product

You’re deploying to one building. Pick off-the-shelf — Aiphone, Akuvox, DoorBird, 2N. Integration to your PMS is a project; the hardware/firmware is not.

You need a product in three months and have never built real-time video. Partner or OEM. Building the SIP + WebRTC + mobile-push stack from scratch on that timeline is a false economy.

The market you’re targeting is tiny. Build-to-own only if you’re servicing hundreds-to-thousands of buildings; below that, margins don’t pay back custom firmware and compliance.

FAQ

Can I reuse my building’s 2-wire intercom wiring?

Not directly for IP video — the bandwidth isn’t there. Two options: a 2-wire-to-IP bridge (Comelit, Akuvox have retrofit kits) which works but adds latency and cost; or rip-and-replace with Cat6 + PoE during a phased upgrade. Bridge first if your budget or disruption tolerance is low, replace later.

How long does installation take?

Single-door commercial: 2–4 hours. 50-unit residential: 2–4 weeks including PMS integration and resident onboarding. 200+ door multi-building: 6–10 weeks with phased go-live. Most schedule risk comes from PMS / access-control integration, not the door hardware itself.

What happens when the internet goes down?

If you’ve designed for offline, local SIP between door and in-unit monitor still works, queued entry events fire when connectivity returns, and cached resident directories remain valid. Remote mobile answering stops. Cloud-only products go dark entirely — don’t ship one into a production building.

What’s a realistic hardware budget per door?

$400–600 for an entry-level IP station (DoorBird S class). $1,500–2,500 for premium multi-tenant with touchscreen and face recognition (Akuvox, Aiphone IX). Cabling and labor add $50–200 per drop. Cloud / SaaS runs $8–25 per door per month at scale.

Will face recognition get me into regulatory trouble?

It can. In the EU keep a DPIA on file, offer an opt-out, and ensure a human reviews any denied-entry decision. In the US pay attention to Illinois BIPA and California CCPA; always get written consent before enrolling a face. Shipping face recognition without this legal scaffolding is the #1 regulatory risk in the category.

Does it integrate with Ring, Nest, Arlo, HomeKit, Alexa?

Selectively. HomeKit and Alexa announce / answer on smart speakers. Ring / Nest interop is usually one-way (video clips) rather than call-answer. Matter 1.5 covers some of this but intercom profiles are still emerging. Plan the integrations you name and nothing more.

Why are mobile notifications slow / sometimes missed?

Two reasons: OS background throttling and non-VoIP push categorization. On iOS use PushKit + CallKit for immediate wake-up; on Android use high-priority FCM messages. Regular notifications can be delayed 5–30 seconds by the OS, which kills the user experience.

How long does it take to build the SaaS platform itself?

MVP in 16–24 weeks if you’re leaning on mature door-station hardware and focusing your build on cloud backend + mobile + directory + integrations. Full production-grade platform with face recognition, PMS integrations, and compliance posture: 30–40 weeks. Our Agent Engineering practice typically shaves 20–30% off the scaffolding and adapter work.

Surveillance

Integrating IoT with Video Surveillance

Reference architecture for IoT video — the neighboring domain to modern intercom.

Voice AI

OpenAI Realtime + WebRTC / SIP / WebSocket

The voice-AI layer that makes intercoms conversational in 2026.

Testing

How to test WebRTC stream quality

Measuring glass-to-glass latency, MOS, freeze rate — the metrics that matter.

Latency

Minimizing latency to less than 1 second

Tuning the audio/video path so residents don’t notice the lag.

Services

Fora Soft video surveillance development

What we build and how — Netcam, VALT, and related deployments.

Ready to ship a video intercom residents actually use?

Three moves cover 80% of success. Pick hybrid cloud with local SIP fallback so the door still works when the internet doesn’t. Spend your latency budget where residents feel it — mobile push and WebRTC answer. Ship face recognition only with the full compliance scaffolding, because regulators now care and residents will ask.

If you’re building a product (not just deploying one), the architecture above is the starting blueprint. If you’re deploying one, use it as a shortlist of questions to hand to your vendor. Either way, real-time video products live or die by the details under the hood.

Talk to the team behind 50+ video and real-time products

30 minutes, your architecture on a whiteboard, walk-away plan for your video intercom product or integration. Whether we end up building it together or not.

Book a 30-min call → WhatsApp → Email us →

  • Technologies