Smart intercom system with video doorbell, mobile integration, and IoT connectivity

Key takeaways

IoT intercom software replaces the wallbox with a cloud. The doorstation becomes a SIP/ONVIF-speaking IoT endpoint; the logic, routing, video, and access control move to a cloud back-end reachable from any phone or desk.

The winning stack is SIP ingest, WebRTC delivery, HTTPS for control plane. SIP keeps you interoperable with 2N, Akuvox, Aiphone, Fanvil, Grandstream, and every IP-PBX on the planet; WebRTC makes the tenant app load the live video tile in under 500ms.

Security is the product, not a checklist. SRTP + SIPS + TLS 1.3 for media and signalling, IEEE 802.1x on the wired doorstation, hardware-backed passkeys on the mobile app, tamper-evident audit log server-side. Miss any one and procurement stalls.

AI is the 2026 differentiator. On-device face detection to pre-filter delivery vs visitor, server-side plate recognition, real-time call summaries, spatial audio for group concierge channels — all shipping this year in credible products.

Fora Soft builds this stack. We ship IP video, WebRTC, IoT control, and access-control software for residential, commercial, and healthcare deployments — our V.A.L.T. video platform is already deployed at 650+ organizations including police departments, medical facilities, and child advocacy centers.

Why Fora Soft wrote this playbook

For 19 years we have been building the plumbing behind connected video, audio, and access-control products — SIP stacks, WebRTC SFUs, ONVIF bridges, IoT firmware, cloud PBX integrations. Over 600 shipped projects, 100% Upwork success rate, 400+ client reviews. Modern IoT intercom software is a straight composition of those components.

We see clients stumble in the same three places: picking an intercom hardware vendor before the software architecture is locked; under-estimating what SIP interoperability actually costs; and treating security and audit as a Phase 2 item, which kills enterprise deals at procurement review. This playbook is the spec we wish every client had on the first scoping call.

The reference points below are taken from systems we have shipped: our video surveillance practice, our WebRTC expertise, and the IoT software we wrote for multi-site commercial deployments where thousands of doorstations sync to one cloud back-end.

Building an IoT intercom product — or fixing one that stalled at 200 units?

Tell us what you’re shipping and we’ll tell you what’s killing your rollout in 30 minutes.

Book a 30-min call → WhatsApp → Email us →

What actually changed: from wallbox to cloud platform

Twenty years ago an intercom was a two-wire analog loop from a doorstation to a handset in the kitchen. Today’s IoT intercom product is a doorstation endpoint, a cloud back-end, a tenant mobile app, and a staff dashboard — with a SIP trunk or a cloud PBX stitching them together. Every line of value you ship is in the software.

Era Where the logic lives User device Typical features
Analog (pre-2005) Doorstation + handset Wall handset Audio, button-to-unlock
IP (2005–2020) Local server / IP-PBX Wall screen + desk phone Video, directory, keypad access
Cloud/IoT (2020–now) Cloud back-end Smartphone + web dashboard Remote unlock, virtual keys, analytics
AI-native (2026+) Cloud + edge AI + LLM Smartphone + voice + smart-home hub Visitor classification, call summaries, auto-routing

Who is buying IoT intercom software in 2026

Three buyer profiles dominate the category — and the product they need looks different.

1. MDU / residential operators. Apartment buildings replacing legacy buzzers with a tenant-app-first product. Value lift: virtual keys for deliveries, visitor history, Airbnb-style rentals without physical key handoff.

2. Commercial property / enterprise campus. Office towers, coworking, data centers. Value lift: SSO / Azure AD integration, visitor management with badge printing, audit trail for compliance (SOC 2, ISO 27001).

3. Healthcare / regulated verticals. Hospitals, aged-care, child advocacy centers, schools. Value lift: HIPAA-grade audit, nurse-call integration, clinician workflow routing. This is the profile that overlaps with our healthcare intercom work.

The protocol stack: SIP, ONVIF, WebRTC, MQTT

Every credible IoT intercom product speaks four protocols. Which one where is the architectural call that determines cost and interoperability.

Protocol Role Where it runs Encrypted variant
SIP Call signalling / registration Doorstation ⇆ cloud / IP-PBX SIPS (TLS)
RTP / RTCP Audio/video media Doorstation ⇆ SFU ⇆ client SRTP (AES-GCM)
ONVIF Profile S/T Camera discovery & metadata Doorstation → VMS HTTPS + WS-Security
WebRTC Sub-500ms live video on mobile SFU → tenant app DTLS-SRTP (mandatory)
MQTT / CoAP IoT telemetry & control Doorstation → cloud broker MQTT over TLS 1.3
FCM / APNs Push wake-up for tenant app Cloud → mobile OS HTTPS (Google/Apple)

Reach for SIP ingest + WebRTC delivery when: you need interoperability with off-the-shelf IP intercom hardware (Akuvox, Fanvil, Aiphone, 2N, Grandstream) AND the tenant experience is an iOS/Android app that must load a live video tile in under a second. That covers 90% of commercial deployments.

Reference architecture for a shippable product

Every IoT intercom software we plan collapses to the seven components below. Each is a distinct deployable we build and operate today.

Layer Component Role Typical tech
Edge device Doorstation firmware SIP client + ONVIF server + NFC/BLE reader Linux embedded, Yocto
Signalling SIP proxy / PBX Call routing, registration, presence Kamailio, Asterisk, FreeSWITCH, 3CX
Media fan-out SFU Distributes WebRTC to N viewers LiveKit, mediasoup, Janus
Control plane IoT broker + REST API Device twins, access policies, telemetry EMQX / HiveMQ / AWS IoT + Nest/Go/Node
AI services Face/plate/visitor classification Pre-filter events, summarise calls YOLOv11, TFLite on edge, Vision-LLM
Tenant app iOS + Android native Live tile, unlock, history, virtual keys SwiftUI, Jetpack Compose, libwebrtc
Admin dashboard Web app Property mgmt, audit, user admin, SSO React/Next.js, OIDC

The feature set that actually sells a 2026 IoT intercom product

Property managers, facilities teams, and hospital operations directors will compare your product against Butterfly MX, 2N, Akuvox, Aiphone, and Latch. Eleven capabilities are table stakes; three are differentiators:

1. Sub-second live video tile. WebRTC, not HLS. If the tenant taps “see visitor” and waits four seconds for video, they stop using the app.

2. Push-to-answer notification. FCM/APNs high-priority channel with a rich preview (snapshot of the visitor) and a two-tap answer flow.

3. Virtual key / QR link for deliveries. Time-boxed (e.g. 20 minutes), single-use, optional geofence. This is the feature that converts tenants from “ok app” to daily-active.

4. Call forwarding / escalation chain. Ring tenant 1, then tenant 2, then concierge, then voicemail. Configurable per-unit.

5. Access methods. App unlock, PIN, NFC card, BLE key, QR code, licence-plate recognition for parking gates. Support all five.

6. Visitor history & clip retention. 30–90 days of events with preview JPEG plus on-demand clip retrieval. Tenants use this to prove a delivery claim; property managers use it to resolve disputes.

7. SSO for enterprise deployments. SAML and OIDC. Azure AD / Okta / Google Workspace integration is non-negotiable above 500 units.

8. SIP trunk / PBX bridge. For buildings that still have a front desk with desk phones. The call should ring the desk when the app is idle.

9. Multi-tenant admin with fine-grained RBAC. Who can issue virtual keys, who can download clips, who can reboot devices.

10. Offline continuity. Doorstation unlocks via cached PIN/NFC even if the cloud is unreachable. This is a deal-breaker for hospital and government buyers.

11. Over-the-air firmware updates. Signed builds with rollback, staged rollout to a canary cohort first. Never brick a door.

Differentiator A — AI visitor triage. Classify visitor intent (delivery / resident’s guest / cold caller) from the camera frame before the tenant sees the notification.

Differentiator B — voice AI concierge. The doorstation answers when no tenant is reachable: “Hi, who are you here to see? I’ll let them know.” Transcribed to a tenant notification.

Differentiator C — spatial audio group calls. For concierge / facilities team channels, spatial audio makes multi-party coordination feel natural. Shipping in credible products this year.

Need WebRTC-grade live video delivery for a smart intercom?

We’ve shipped WebRTC SFU deployments at the scale where sub-500ms tiles just work. Let’s talk.

Book a 30-min call → WhatsApp → Email us →

Security, privacy, and compliance essentials

Your doorstation is a camera, microphone, and lock on the perimeter of a building. Every security detail matters. Nine non-negotiables we ship into every IoT intercom product:

1. TLS 1.3 everywhere. Control plane, MQTT, HTTP API — nothing in the clear, ever.

2. SRTP + DTLS for media. AES-GCM, key rotation per session. WebRTC enforces this; SIP intercom firmware often does not — verify.

3. IEEE 802.1x on wired doorstations. Prevents network-jack attacks in commercial properties.

4. Hardware-backed device identity. Per-device certificate provisioned at factory, stored in a secure element (ATECC608A or equivalent). No shared secrets.

5. Passkeys / WebAuthn on the tenant app. Retire SMS OTP; use device-bound passkeys.

6. Tamper-evident audit log. Every unlock, call, config change — server-side, append-only, exportable for SOC 2 / HIPAA auditors.

7. Signed firmware + rollback protection. No unsigned updates; monotonic version counter prevents downgrade to vulnerable firmware.

8. PII minimisation. Face embeddings stored as one-way hashes if identity matching is done server-side. Never store raw face images longer than needed.

9. GDPR / HIPAA data-flow diagrams. Publish them. Enterprise procurement asks on day one.

The AI layer: where to put intelligence in an intercom product

AI in an intercom is a hybrid: some models run on the doorstation chipset, some in the cloud. Getting the split right keeps latency down, bandwidth cheap, and privacy defensible.

Capability Runs where Typical model Engagement lift
Person/vehicle detection Doorstation (edge) YOLOv11-nano / MobileNet-SSD Cuts false alerts 90%
Visitor intent classification Cloud GPU Vision-LLM (small) Smarter notification routing
Licence plate recognition Edge + cloud PaddleOCR / commercial ANPR Keyless parking entry
Voice concierge / triage Cloud LLM + TTS OpenAI / Anthropic / self-hosted Fewer missed visitors
Call summary & transcript Cloud Whisper + LLM summariser Audit trail at zero effort

See our AI-powered video surveillance deep dive for the training-data and MLOps side of this.

Mini case: a multi-site commercial intercom rollout

Situation. A facilities operator with 40 commercial properties and 1,200+ doors wanted to retire a patchwork of legacy Aiphone hardware and three different call-routing engines. Tenants complained about slow answer times; facilities had no audit trail for unlock events.

12-week plan. Weeks 1–3: dual-write bridge between legacy Aiphone SIP stack and our new cloud back-end; existing hardware keeps working. Weeks 4–8: tenant iOS/Android app with WebRTC live tile + virtual keys + unlock history. Weeks 9–12: admin dashboard with SAML SSO, audit export, device fleet health; controlled replacement of 40 doors with newer 2N firmware.

Outcome pattern. Tenant notification-to-answer time dropped from ~8s to ~1.4s; audit exports went from “we can’t” to a one-click CSV; facilities team recovered ~9 hours per week previously spent on manual call routing. The existing intercom hardware kept working throughout — no big-bang replacement required. Want a similar migration plan? Book 30 minutes.

What an IoT intercom build actually costs

Realistic ranges for teams using our Agent Engineering workflow (Claude + senior reviewers), which compresses the delivery curve vs traditional outsourcing. Treat as a starting point.

Scope What’s in Timeline Ballpark
Cloud back-end + tenant iOS/Android MVP SIP/WebRTC bridge, push, unlock, history 12–16 weeks From $60k
+ admin dashboard + SSO + audit React/Next.js, SAML/OIDC, RBAC +6–8 weeks +$25–35k
+ AI layer Edge detection, cloud triage, call summaries +6–10 weeks Sized per device count
Ongoing product care Firmware releases, SIP compat, scaling Monthly retainer Team-as-a-service

A decision framework — pick your stack in five questions

Q1. Are you shipping the hardware too, or software-only on top of 3rd-party intercoms? Hardware → add firmware + certification to the plan (12–18 extra weeks). Software → interoperability with 2N/Akuvox/Aiphone/Fanvil dominates the design.

Q2. Is there an existing PBX or SIP trunk customers must keep? Yes → bridge pattern: your cloud speaks SIP to the customer’s IP-PBX. No → run your own cloud PBX and save a subscription fee.

Q3. Regulated vertical (HIPAA, CJIS, GDPR, SOC 2)? Yes → single-tenant or on-prem option, encrypted audit, BAAs; budget +15–25% for compliance engineering. See our healthcare intercom playbook.

Q4. How many devices per site, and across how many sites? < 50 devices / 1 site → single-region cloud is fine. > 500 devices across regions → multi-region deploy, per-region SFU, geo-steered push.

Q5. Who opens the app at 2am when the doorbell rings? The tenant profile decides the feature priority: residential wants virtual keys + deliveries; commercial wants visitor management + SSO; healthcare wants audit + nurse-call integration.

Five pitfalls we see kill IoT intercom projects

1. Picking hardware before the software architecture is done. The SIP dialect of your chosen doorstation shapes your SIP proxy, your RTP timing, and your interop test matrix. Lock the architecture, then shortlist vendors.

2. Underestimating offline behaviour. The first time the cloud dies and every door in a building refuses to open is the last day the product has a customer. Ship offline-cached unlock paths from v1.

3. SMS OTP as the only MFA. It no longer passes enterprise security review. Ship passkeys / WebAuthn on day one.

4. Building “just” a resident app. Without a concierge / ops dashboard, facilities teams resist the rollout. Both audiences ship together, always.

5. Skipping OTA firmware infrastructure. Staged rollout with canary cohort + rollback is not optional; it is how you avoid a bad build bricking 1,000 doors in a weekend.

Rebuilding your intercom software stack?

We’ve shipped SIP + WebRTC + IoT combinations at multi-site scale. Bring the brief — we’ll scope the migration without forcing a hardware swap.

Book a 30-min scoping call → WhatsApp → Email us →

KPIs: what to measure before and after every release

Quality KPIs. Tenant notification-to-answer time < 2s; WebRTC tile time-to-first-frame < 800ms on LTE; unlock command success rate > 99.5%; false-detection rate on edge AI < 3%.

Business KPIs. Virtual-key issuance per tenant per month (adoption proxy); daily-active tenants > 40%; missed-call rate < 8%; concierge escalation rate (for commercial) declining release-over-release.

Reliability KPIs. Doorstation uptime > 99.95%; OTA update success rate > 99%; cloud API p95 < 250ms; crash-free sessions > 99.5% on tenant apps.

Integrations that unlock enterprise sales

The intercom is one edge in a building’s technology graph. Six integrations turn a commodity product into a platform sale:

1. Access-control / badge systems. HID, Genetec, LenelS2, Kisi. Unlock events flow both ways.

2. VMS / NVR. Milestone XProtect, Avigilon, Genetec Security Center. ONVIF Profile S/T compliance is the ticket.

3. Property-management software. Yardi, AppFolio, Buildium. Tenant CRUD, lease life-cycle, virtual-key entitlement sync.

4. Smart-home ecosystems. Matter / Thread, Apple HomeKit, Google Home, Alexa. Unlock from a wall panel; announce visitor on speakers.

5. Calendar / visitor-management. Envoy, Proxyclick, iOffice. Pre-announced visitors skip the call.

6. IP-PBX / UC. Cisco Webex, 3CX, RingCentral. The front-desk phone still rings when the tenant app does not answer.

A 90-day rollout plan that actually ships

Weeks 1–3. Architecture lock: vendor shortlist, protocol matrix, security questionnaire, cloud region selection. One tenant mobile prototype calling one doorstation end-to-end.

Weeks 4–8. Cloud back-end build-out: SIP proxy, SFU, IoT broker, push, tenant app MVP with live tile + unlock + history.

Weeks 9–11. Admin dashboard, SSO, RBAC, OTA firmware pipeline, audit export. Pilot deployment to 2–3 sites with internal staff as “tenants”.

Week 12. Telemetry review, soak-test, go/no-go gate for the first production site. Staged rollout to 10% of units first.

When not to build your own IoT intercom platform

Not every operator needs to own a software platform. Skip the custom build if: you manage fewer than ~30 doors total; you have no product team to operate the back-end long-term; and your tenants are happy with an off-the-shelf Akuvox or Butterfly MX consumer-grade SaaS.

Revisit the build option when your differentiation story depends on the software (virtual keys, AI triage, vertical-specific workflow) and you have a credible path to > 1,000 units of distribution.

Matter / Thread as the home integration fabric. The smart-home bridge is consolidating on Matter. Ship Matter-over-Wi-Fi/Thread support for residential products.

Conversational AI concierge. Real-time ASR + LLM for voice triage is commoditising — expect it to move from differentiator to table stakes by late 2027.

On-device face identity with privacy guarantees. Face embeddings generated on the doorstation, matched against a per-site encrypted index. Avoids centralised biometric databases.

Tighter convergence with video surveillance. The doorstation and the building’s VMS are increasingly the same event stream. See our custom video surveillance solutions for the VMS-side view.

FAQ

Do I need my own cloud PBX, or can I use a third-party SIP trunk?

Start with a hosted SIP trunk (Twilio, Telnyx, Voxbone) to get to market fast. Move to a self-operated Kamailio or FreeSWITCH only once unit economics or regulatory constraints justify the ops overhead. A common hybrid: Kamailio in-house for routing and registration, Twilio for PSTN egress.

Can I keep existing Aiphone / 2N hardware and just modernise the software?

In most cases yes. The hardware speaks SIP and ONVIF; you bridge into a new cloud back-end and layer the tenant app + admin dashboard on top. We deliver this as a dual-write migration so the legacy system keeps working during cutover.

WebRTC vs RTSP vs HLS — which should I use inside the tenant app?

WebRTC for live doorstation view — sub-500ms end-to-end. RTSP/SIP stays on the server side, where the SFU pulls from the doorstation. HLS shows up only for archived clip playback in the dashboard, not the real-time call.

How do I handle door unlocks when the cloud is unreachable?

Cache the unlock credentials on the doorstation. PIN, NFC, and BLE key verification happen locally against a signed, periodically-refreshed policy bundle. The device reports the event asynchronously when connectivity returns. This is a hard requirement for hospital and government deployments.

Is SMS OTP acceptable for tenant MFA in 2026?

No. Enterprise security reviews flag SMS OTP for SIM-swap vulnerability. Ship passkeys / WebAuthn as the primary factor, with TOTP as fallback. Both are native on modern iOS and Android.

How fast is the tenant notification latency we should target?

Doorbell press to phone ring under 1.5 seconds; phone tap to live video tile under 800ms on LTE. These are the thresholds at which the experience feels native rather than flaky. WebRTC + FCM/APNs high-priority channel + a warm signalling socket hit the numbers.

What is the minimum compliance posture for healthcare deployments?

BAAs with every subprocessor, TLS 1.3 everywhere, SRTP for media, tamper-evident audit log with 6+ years retention, role-based access control, and explicit PHI minimisation in call transcripts. See our healthcare intercom benefits deep dive.

How long does it take Fora Soft to ship an IoT intercom MVP?

A SIP-bridge + tenant iOS/Android app MVP with live WebRTC tile, unlock, and history ships in 12–16 weeks. Adding a full admin dashboard with SSO and audit takes 6–8 more weeks. The AI layer on top is another 6–10 weeks, sized to the camera count. Our Agent Engineering workflow compresses these vs traditional outsourcing.

Primer

IoT intercom systems: the 2026 buyer’s guide

The vendor landscape and decision criteria for picking hardware.

AI layer

How machine learning in intercoms enhances communication

The model selection, training, and deployment detail behind the AI layer.

Vertical

How intercom software benefits healthcare facilities

HIPAA-grade intercom: compliance, nurse-call integration, workflow routing.

AI surveillance

AI-powered video surveillance: 6 benefits for business security

The classifier and anomaly-detection stack that powers smart alerts.

Case study

VALT: video surveillance recognised by US police

The multi-site audit + retention + workflow pattern we reuse in intercom.

Ready to ship an IoT intercom product buyers actually choose?

The formula is consistent: SIP + ONVIF on the edge, WebRTC for the tenant experience, MQTT/TLS for the control plane, passkeys for auth, signed-and-staged OTA, tamper-evident audit, and an AI layer that actually saves the tenant a tap. Miss one and either engagement stalls or procurement stalls.

We ship this stack every quarter. Bring us your current architecture or your spec — we will map the 90-day path to a product buyers pick over the incumbents, and tell you what it costs.

Let’s scope your IoT intercom platform

30 minutes, realistic estimate, concrete next steps — no obligation.

Book a 30-min call → WhatsApp → Email us →

  • Technologies