Knowledge baseReal-time voice & video
AI agent development · 2026 guide

A practical guide to building production AI agents for real-time voice and video.

How real-time AI agents work end to end. The twelve components every production stack ships. The 5-stage runtime that holds 1.2 to 3.5 seconds of end-to-end latency. The build-versus-buy thresholds that determine whether a custom agent or a SaaS platform is the right call. Written from the agents we have shipped in industrial surveillance, healthcare voice intake, real-time interpretation, and live event translation.

20+ years in real-time tech (since 2005)
|
625+ projects shipped
|
400+ clients · 100% Upwork score
Industry recognition · 2019–2024
Top WebRTC Developer
2022 · category leader
Top Telecommunications Software Developer
2024 · industry recognition
EASA Best Software Development Partner
2019 · category winner
DesignRush Top App Development Companies
2020 · platform recognition
⭐ Clutch
5.0 / 5.0 across 30 reviews
Quick answer

An AI agent for real-time voice and video is a software service that processes audio and video frames simultaneously through ASR, a vision-language model, and an LLM, then streams a response back via TTS or video synthesis with 1.2 to 3.5 seconds of end-to-end latency.

Every production stack ships the same twelve components. Media gateway. ASR. LLM gateway. Function-call layer. TTS. Video synthesis (when the agent has a face). Persistent memory. Observability. Billing and metering. Guardrails. Evaluation. Knowledge base. What changes per project is which vendor fills each slot. The slots themselves do not change.

Build custom when you need real-time video, custom turn-taking, regulated-industry compliance (HIPAA, SOC 2, GDPR, FINRA), brand voice cloning, or the agent embedded inside your own product. Use SaaS (Vapi, Retell, OpenAI Realtime API) when the agent is a generic voice receptionist running under 50 hours a month. Hybrid is the most common 2026 pattern: SaaS for the runtime, custom for the domain layer. The decision threshold is usage volume plus compliance scope.

Verticals and Topics covered in this guide

Voice agents, video agents, multimodal. The verticals each one fits.

Healthcare · HIPAALegal triageFinancial · FINRAEdTechIndustrial surveillanceLive event translationCustomer support intakeMulti-agent orchestrationCustom turn-taking

Four shapes of AI agent dominate the 2026 landscape. Each gets a different stack, a different latency budget, and a different failure mode.

How AI agents work in production

Five stages. Every production agent has them.

Real-time AI agents process audio and video frames simultaneously, run them through ASR + VLM + LLM in parallel, and stream the response back via TTS or video synthesis — typically with 1.2–3.5 seconds end-to-end latency. Click a stage to see what happens at that step.

0 ms End-to-end target 1.2 – 3.5 s ~3.5 s
Stage 01 — Join

Real-time session join

A WebRTC media gateway opens the call: ICE candidates resolve, STUN/TURN servers negotiate NAT traversal, codec capabilities are exchanged, and the room is established. SIP fallback covers PSTN telephony. Tuned signaling, careful TURN-server placement, and codec selection keep the join fast and reliable.

BudgetUnder 800 ms to join
Failure modeCold-start workers add 1.5–3 s. Fix: warm pool of 2–3 idle workers per region.

End-to-end target: 1.2–3.5 s for production-grade multimodal agents. Stages 01–04 stay on the critical path; stage 05 is fully async.

Types of AI agents in production

Six shapes that ship in real production.

Each one has a different latency budget, observability priority, and failure mode. The shape of the agent determines which of the twelve components matter most.

System architecture

The 12 components every production AI agent needs.

Skipping any of them is technical debt that surfaces inside the first month of production traffic. The architecture is independent of vendor — what changes per project is which media gateway, which LLM, which TTS. The 12 components do not change. Click any component to see vendors, when it's mandatory, and how it fails.

Critical pathComponent 01

Media gateway

The transport layer for every real-time call. A WebRTC SFU (Selective Forwarding Unit) accepts client streams, handles ICE/STUN/TURN, manages bandwidth, and routes media to the agent's audio and video processors. SIP fallback wires telephony in.

Common vendorsLiveKit (Cloud or self-host), mediasoup, Janus, Pion, Agora, Daily
Connects toCapture (02) on input · Response (TTS/Video synthesis) on output
When mandatoryAlways. Every real-time agent needs a media gateway.
Failure modeCold-start workers add 1.5–3 s to join. Fix: warm pool of 2–3 idle workers per region.

Six components on the critical path · five run async · one is optional. The synchronous chain is the latency budget. The async chain is the observability and learning loop. Skip any of them and the agent ships fragile.

Production examples

Four shipped systems and the architectural choices that made them work.

Industrial multimodal AI, healthcare voice intake, real-time interpretation, live event translation. Four production builds currently running.

Decision framework

Build custom, buy SaaS, or hybrid. When each one wins.

Three architectural paths for shipping an AI agent. None is universally correct. The right choice is a function of usage volume, compliance constraints, customization depth, and whether the agent has to live inside your own product.

Cost ranges are 2026-indicative. Implementation specifics dominate the spread within each tier: compliance scope, scaling architecture, multi-agent orchestration, observability depth.

FAQ

Twelve questions every architecture review covers.

How much does it cost to build a custom AI agent?

A production-grade custom AI agent typically costs $10,000 to $40,000 to build over 2 to 6 months, depending on whether voice, video, or multimodal capabilities are required. Ongoing operations run $4,000 to $15,000 per month.

How long does it take to build a production-grade AI agent?

Three to six weeks to MVP for a single-agent voice or video build. Two to six months for a full production multi-agent system with observability, guardrails, billing, and custom integrations.

Should I build my own AI agent or use Vapi/Retell?

Vapi and Retell handle the agent runtime well for generic voice use cases. They struggle with real-time video, custom turn-taking, and regulated-industry workflows. Use SaaS when usage is under 50 hours per month. Switch to custom when usage clears 200 hours per month or when you need anything Vapi and Retell cannot serve.

What's the difference between AI voice agents and AI video agents?

AI voice agents process audio only: speech in, speech out. AI video agents add a vision-language model that watches the video stream alongside the audio loop. Multimodal agents handle both, plus screen share and live data, with end-to-end latency typically between 1.2 and 3.5 seconds.

Can you build AI agents that handle real-time video?

Yes. Real-time video is our primary specialty. We have shipped video AI agents across industrial surveillance (MindBox: 99.5%+ facial recognition, ANPR at 500,000+ vehicles per day), real-time interpretation (TransLinguist: 16+ languages, NHS National Framework winner), and live event translation (VOLO.live: 22,000+ participants at Black Hat 2025).

Do you work with LiveKit, Agora, or both?

Both. We run a dedicated LiveKit AI agent development practice and we ship on Agora when the client already runs on it. We also build on custom WebRTC SFUs (mediasoup, Janus) when the client wants infrastructure portability. The client owns the IP in every case.

What's a multimodal AI agent?

A multimodal AI agent processes more than one input modality (voice, video, screen share, live data) in parallel. The architecture runs ASR, a vision-language model, and an LLM concurrently, with shared session state, and emits responses across the same modalities. End-to-end latency typically 1.8 to 4.5 seconds.

Is OpenAI Realtime API enough on its own?

For a single-agent prototype with strong in-house engineering, yes. For a production system, no. The Realtime API gives you a low-latency speech-to-speech loop. It does not give you an agent runtime, function-call orchestration, observability, billing, multi-agent coordination, or evaluation. Those are your problem.

How do you handle HIPAA / SOC2 / GDPR compliance for AI agents?

AI agent development for healthcare requires HIPAA-compliant infrastructure, audit logging, and BAA-able vendor agreements. Most SaaS AI-agent platforms do not offer BAAs. Our default healthcare stack uses BAA-able cloud providers, data-residency controls, encrypted persistence, and full audit logs at the function-call layer.

What stack do you use to build AI agents?

Media: LiveKit, custom WebRTC (mediasoup, Janus), Agora. ASR: Deepgram, AssemblyAI, OpenAI Whisper. LLM: GPT-4o, Claude, Gemini, routed via a gateway. TTS: ElevenLabs, Cartesia, OpenAI. Vector store: Pinecone, Weaviate, pgvector. Observability: OpenTelemetry plus a custom turn-replay layer. Deployment: Kubernetes on AWS, GCP, or Azure.

Can you take over an AI agent project that's stalled?

Yes. Takeover engagements are a regular part of our AI agent work. We start with a code and architecture audit (free as the first step), produce a written fault analysis, and propose either a fix-in-place plan or a controlled rebuild on a parallel track.

How do you measure AI agent quality and accuracy?

Per-turn rubric scoring against a labeled dataset. End-to-end latency at p50, p95, and p99. Function-call success rate, hallucination flag rate, intent recognition, response groundedness, cost per session. Regressions trigger alerts. Evaluation runs continuously, not just at release.

Where this guide goes deeper

Connected guides and references.

Each piece below extends one slice of this pillar: the transport foundation, the LiveKit agent runtime, the multimodal cross-cluster, the translation architecture, or the commercial paths to commissioning a build.

Have a specific architectural question?

Engineer-to-engineer review on the first call.

If you are scoping a real-time voice or video agent and want a second opinion on the stack, the latency budget, the compliance approach, or the build-versus-buy threshold, write us. A senior engineer with shipped agents in production replies within 24 hours.

+1 (914) 775-5855
New York · USA
Specialist software house for video, real-time and AI products. Founded 2005.
50 in-house engineers.
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.