How real-time AI agents work end to end. The twelve components every production stack ships. The 5-stage runtime that holds 1.2 to 3.5 seconds of end-to-end latency. The build-versus-buy thresholds that determine whether a custom agent or a SaaS platform is the right call. Written from the agents we have shipped in industrial surveillance, healthcare voice intake, real-time interpretation, and live event translation.
An AI agent for real-time voice and video is a software service that processes audio and video frames simultaneously through ASR, a vision-language model, and an LLM, then streams a response back via TTS or video synthesis with 1.2 to 3.5 seconds of end-to-end latency.
Every production stack ships the same twelve components. Media gateway. ASR. LLM gateway. Function-call layer. TTS. Video synthesis (when the agent has a face). Persistent memory. Observability. Billing and metering. Guardrails. Evaluation. Knowledge base. What changes per project is which vendor fills each slot. The slots themselves do not change.
Build custom when you need real-time video, custom turn-taking, regulated-industry compliance (HIPAA, SOC 2, GDPR, FINRA), brand voice cloning, or the agent embedded inside your own product. Use SaaS (Vapi, Retell, OpenAI Realtime API) when the agent is a generic voice receptionist running under 50 hours a month. Hybrid is the most common 2026 pattern: SaaS for the runtime, custom for the domain layer. The decision threshold is usage volume plus compliance scope.
Four shapes of AI agent dominate the 2026 landscape. Each gets a different stack, a different latency budget, and a different failure mode.
Phone-based receptionists, intake, scheduling, qualification. Tuned for end-to-end latency under 1.2 seconds with explicit barge-in handling so the conversation does not feel robotic.
Real-time video agents that watch a stream, react, and respond. Live moderation, incident detection, virtual receptionists with face recognition, video-interpreting agents that point at and explain what the camera sees.
Voice plus video plus screen-share plus live data. Multimodal agents process audio and video frames simultaneously, run them through ASR, a vision-language model, and an LLM in parallel, and stream the response back via TTS or video synthesis.
Healthcare intake (HIPAA-required), legal triage, e-commerce assist, financial qualification (FINRA-aware). Vertical agents need vertical-aware guardrails. That is the reason generic SaaS struggles with them.
Each one has a different latency budget, observability priority, and failure mode. The shape of the agent determines which of the twelve components matter most.
Customer support, intake, scheduling, qualification. LiveKit + ASR + GPT-4o + ElevenLabs. End-to-end latency 800 ms to 1.4 s.
Multimodal agents that watch the stream, point at things, explain what they see. Telemedicine triage, equipment troubleshooting, in-car assistants.
Call routing, lead qualification, hand-off to a human at the right moment. Function-call density is high. The wrong moment to escalate is the failure mode that matters.
Live-stream moderation, harmful-content detection, profanity filtering, brand-safety enforcement. Silent agents that watch the room.
Three or more agents coordinating under a supervisor pattern. Triage agent, specialist agent, escalation agent. Hardest to ship. Biggest leverage when it does.
Healthcare intake (HIPAA), legal triage, e-commerce assist, financial qualification (FINRA-aware). Vertical-aware guardrails baked in from the first build.
Industrial multimodal AI, healthcare voice intake, real-time interpretation, live event translation. Four production builds currently running.
ANPR identifying 500,000+ vehicles per day at roughly 95% accuracy. Automatic anomaly detection (intrusions, fires, crowd buildup) with instant operator alerts and event-triggered recording. Deployed across transport, pharma, and gated communities.
The agent verifies caller identity, looks up the patient record, presents available slots with provider names, answers provider-info questions, books the appointment, and schedules reminder notifications. Replaces a manual scheduling workflow that previously ran through receptionists.
Marketplace of 30K+ certified interpreters. Won the NHS National Framework for Language Services tender covering NHS organizations, councils, schools, police forces, and fire services. Clients report 80% reduction in interpreting costs and 2x ROI inside two years.
Attendees scan a QR code, pick a language, and receive AI-powered subtitles or voiceovers instantly. No app install. Deployed at Black Hat Briefings 2025 (22,000+ participants), HIMSS 2025/2026, and GDC 2026. Speakers switch language live mid-talk.
Three architectural paths for shipping an AI agent. None is universally correct. The right choice is a function of usage volume, compliance constraints, customization depth, and whether the agent has to live inside your own product.
Wins when: real-time video is required, turn-taking needs domain logic, the industry is regulated (HIPAA, SOC 2, GDPR, FINRA), the brand needs voice cloning, the agent has to be embedded inside your own product, or you are orchestrating three or more agents.
Cost shape: $10K to $40K over 2 to 6 months. Plus ~$2K monthly operations (depends on the specific case). Usage above ~200 hours per month tends to make custom cheaper than SaaS at scale.
Wins when: the agent is a voice receptionist or a generic intake bot, usage runs under 50 hours per month, CRM integration is webhook-grade, and there are no compliance constraints (no BAA, no FINRA supervision) that the vendor does not already contract around.
Cost shape: Vapi or Retell run $30K to $150K per year at common volumes. OpenAI Realtime API direct is cheaper if you build the orchestration yourself. You inherit observability, evaluation, and guardrails as your problem.
Wins when: the runtime fits a SaaS platform but the domain layer (custom CRM logic, multi-vertical routing, compliance-specific workflows, voice cloning) needs your own code wrapped around it. The most common 2026 pattern.
Cost shape: SaaS subscription plus $8K to $30K for the custom layer plus smaller monthly maintenance. Designed to migrate to full custom if usage scales past ~200 hours per month.
Cost ranges are 2026-indicative. Implementation specifics dominate the spread within each tier: compliance scope, scaling architecture, multi-agent orchestration, observability depth.
A production-grade custom AI agent typically costs $10,000 to $40,000 to build over 2 to 6 months, depending on whether voice, video, or multimodal capabilities are required. Ongoing operations run $4,000 to $15,000 per month.
Three to six weeks to MVP for a single-agent voice or video build. Two to six months for a full production multi-agent system with observability, guardrails, billing, and custom integrations.
Vapi and Retell handle the agent runtime well for generic voice use cases. They struggle with real-time video, custom turn-taking, and regulated-industry workflows. Use SaaS when usage is under 50 hours per month. Switch to custom when usage clears 200 hours per month or when you need anything Vapi and Retell cannot serve.
AI voice agents process audio only: speech in, speech out. AI video agents add a vision-language model that watches the video stream alongside the audio loop. Multimodal agents handle both, plus screen share and live data, with end-to-end latency typically between 1.2 and 3.5 seconds.
Yes. Real-time video is our primary specialty. We have shipped video AI agents across industrial surveillance (MindBox: 99.5%+ facial recognition, ANPR at 500,000+ vehicles per day), real-time interpretation (TransLinguist: 16+ languages, NHS National Framework winner), and live event translation (VOLO.live: 22,000+ participants at Black Hat 2025).
Both. We run a dedicated LiveKit AI agent development practice and we ship on Agora when the client already runs on it. We also build on custom WebRTC SFUs (mediasoup, Janus) when the client wants infrastructure portability. The client owns the IP in every case.
A multimodal AI agent processes more than one input modality (voice, video, screen share, live data) in parallel. The architecture runs ASR, a vision-language model, and an LLM concurrently, with shared session state, and emits responses across the same modalities. End-to-end latency typically 1.8 to 4.5 seconds.
For a single-agent prototype with strong in-house engineering, yes. For a production system, no. The Realtime API gives you a low-latency speech-to-speech loop. It does not give you an agent runtime, function-call orchestration, observability, billing, multi-agent coordination, or evaluation. Those are your problem.
AI agent development for healthcare requires HIPAA-compliant infrastructure, audit logging, and BAA-able vendor agreements. Most SaaS AI-agent platforms do not offer BAAs. Our default healthcare stack uses BAA-able cloud providers, data-residency controls, encrypted persistence, and full audit logs at the function-call layer.
Media: LiveKit, custom WebRTC (mediasoup, Janus), Agora. ASR: Deepgram, AssemblyAI, OpenAI Whisper. LLM: GPT-4o, Claude, Gemini, routed via a gateway. TTS: ElevenLabs, Cartesia, OpenAI. Vector store: Pinecone, Weaviate, pgvector. Observability: OpenTelemetry plus a custom turn-replay layer. Deployment: Kubernetes on AWS, GCP, or Azure.
Yes. Takeover engagements are a regular part of our AI agent work. We start with a code and architecture audit (free as the first step), produce a written fault analysis, and propose either a fix-in-place plan or a controlled rebuild on a parallel track.
Per-turn rubric scoring against a labeled dataset. End-to-end latency at p50, p95, and p99. Function-call success rate, hallucination flag rate, intent recognition, response groundedness, cost per session. Regressions trigger alerts. Evaluation runs continuously, not just at release.
Each piece below extends one slice of this pillar: the transport foundation, the LiveKit agent runtime, the multimodal cross-cluster, the translation architecture, or the commercial paths to commissioning a build.
If you are scoping a real-time voice or video agent and want a second opinion on the stack, the latency budget, the compliance approach, or the build-versus-buy threshold, write us. A senior engineer with shipped agents in production replies within 24 hours.