How LiveKit AI agents actually work end-to-end at production scale. The five-stage runtime that every voice and video agent runs. The AgentSession primitive that unified the v0.x abstractions. The plugin ecosystem (Deepgram, AssemblyAI, Whisper · GPT-4o, Claude, Gemini · ElevenLabs, Cartesia, Azure) and the latency contributions of each layer. The Cloud-vs-self-host economics that cross over at ~10,000 minutes per month. Written from the agents we have shipped — Translinguist (16+ language pairs), VOLO.live (22K participants at Black Hat 2025), MindBox (industrial vision + voice).
LiveKit is two layers in one open-source platform. livekit-server is a WebRTC SFU (Apache 2.0, Go) — same category as mediasoup, Janus, Pion. livekit-agents is the agent framework that runs on top: an agent worker joins a LiveKit room as a participant, runs streaming STT, calls an LLM, and publishes streaming TTS back into the room.
LiveKit Agents went 1.0 in April 2025; as of April 2026 the Python framework sits at 1.5.x with adaptive interruption handling and native Model Context Protocol (MCP) tool support. LiveKit Cloud is the managed hosting option for both layers; self-hosted on Kubernetes is the alternative. A production LiveKit AI agent has five stages: session join, media capture and processing, AI reasoning loop (STT → LLM → tool-calls → TTS), response generation and output, and continuous context.
End-to-end latency in 2026: sub-500 ms is achievable with disciplined streaming at every stage, but the industry-published median sits at 1.4–1.7 seconds with P99 at 3–5 seconds. The biggest single lever is LLM time-to-first-token. The biggest architectural decision is Cascaded (STT → LLM → TTS) vs Speech-to-Speech (OpenAI Realtime, Gemini Live) — most production agents in 2026 default to Cascaded for tool-calling reliability and observability.
A custom LiveKit AI agent costs $5K–$80K to build over 1–4 months. LiveKit Cloud at production scale runs $0.05–$0.18 blended per minute (~40× cheaper than human BPO). Above 10,000 minutes per month the framework path undercuts managed Vapi / Retell by 60–80%; below 500 minutes per month managed wins. Above 100,000 minutes per month, self-hosted LiveKit on Kubernetes typically wins on TCO within six months.
Four shapes of LiveKit agent dominate the 2026 landscape. Each gets a different runtime configuration, a different plugin stack, and a different deployment shape.
Cascaded STT → LLM → TTS pipeline. Streaming Deepgram + GPT-4o-mini + Cartesia is the 2026 default. End-to-end latency 450–950 ms achievable; production median 1.4–1.7 s. Common use cases: tier-1 support, appointment booking, outbound qualification.
LiveKit’s agent runtime treats video as a first-class media stream. Frames sampled at 1–4 fps into GPT-4o vision, Claude 3.5 Sonnet, or Gemini 2.5. Production examples: skin / wound assessment, industrial site monitoring, instructional review.
The full pattern: voice in, vision-language reasoning, screen-share understanding, tool-calls into business systems, structured response back as voice. End-to-end latency 2.0–4.5 s at production quality.
LiveKit SIP and Phone Numbers shipped GA in 2025. A LiveKit agent now answers any phone on Earth with ~4 lines of configuration. PSTN inbound, outbound INVITE, DTMF, SIP REFER all first-class.
An AI sales-coach analyzing customer behavior in real time. An AI stylist running on-device on iPhone. Real-time speech-to-speech translation across 75+ languages. AI-powered video management at 99.5%+ facial-recognition accuracy. Four production builds running today — each a different shape of AI agent on top of real-time media.
Swedish AI sales video platform that analyzes customer behavior during live presentations, identifies speech strengths and weaknesses, and generates post-meeting reports. Integrates with Google Meet, MS Teams, Zoom. Captures conversations across video, phone, email, chat. Outcomes: up to 25% lift in close rates, 30× coaching efficiency, 80–100% CRM data-entry automation. SEK 21M (~$2.25M) seed in 2025.
Personalized AI stylist on iOS in the $4.3B AI-wardrobe market (projected $13.5B by 2033). Custom-trained YOLOv8m for garment object detection plus CLIP for semantic understanding — all running on-device via TensorFlowLite. Recognizes garment type, season, color, fabric, sleeve length, neckline. Generates outfit suggestions based on wardrobe + weather + occasion. No round-trip to the cloud.
Video conferencing platform for professional interpreters — marketplace of 30,000+ certified interpreters, estimated $4.2M annual revenue. AI speech-to-speech translation in 16+ languages with closed captioning in 22 languages. Won the NHS (NOE CPC) national framework across NHS, councils, schools, police, fire/rescue. Clients report 50% cost savings, 80% reduction in interpreting costs, 2× ROI in two years.
AI-powered intelligent video management system — 50+ deployments across transport, pharma, and gated communities since 2020. Facial recognition at 99.5%+ accuracy beats Google + Facebook benchmarks, with anti-spoofing resistant to photo/video attacks. ANPR module captures license plates of 500,000+ vehicles daily across India at ~95% accuracy. Real-time anomaly alerts trigger automatic recording on intrusion, fire, or crowd buildup.
Three architectural paths for shipping a real-time AI voice or video agent. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether the platform is the product or supports the product.
Wins when: real-time video on the agent, LLM provider flexibility, self-host option for compliance, usage above 10K minutes/month, custom turn-taking / voice cloning / RAG, brand-embedded experience, multi-tenant SaaS plays.
Cost shape: $5K–$80K build over 1–4 months. $1K–$5K monthly operations. LiveKit Cloud $0.05–$0.18 per minute (or self-host above 100K min/mo for 30–50% savings). Archetypes: Mindwibe, Translinguist, VOLO.live, MindBox.
Wins when: under 10K minutes/month, standard call patterns, no compliance pressure, no in-house engineering capacity. Vapi for developer-first teams (~2 hours to first call). Retell for visual-builder non-technical teams. Bland for outbound-heavy TCPA workflows.
Cost shape: $0.05–$0.13 per minute (Vapi list) or $0.23–$0.50 BYOK. No upfront build cost. Vendor lock-in.
Wins when: validating voice UX before committing to a framework, running staging on Vapi / Retell while building production on LiveKit, or half the workload fits managed (the voicemail leg) while the other half needs LiveKit (the live agent leg).
Pattern: start on Vapi / Retell for 4–8 weeks to validate; build the LiveKit production version once usage clears 10K minutes/month; cut over with a portability layer so prompt and tools transfer.
Cost ranges are 2026-indicative. Implementation specifics — concurrency target, compliance scope, multimodal vs voice-only, tool-call surface, multi-region cascade — dominate the spread within each tier.
A custom LiveKit AI agent costs $5K–$80K to build over 1–4 months. The break-even point vs managed Vapi / Retell sits around 10,000 minutes per month — below, managed ships faster; above, LiveKit undercuts on per-minute economics by 60–80%.
LiveKit is an open-source WebRTC SFU paired with an agent framework (livekit-agents) for building real-time voice and video AI agents. The SFU handles media transport. The agent framework runs as a participant in the room, processes audio and video through STT, calls an LLM, and publishes responses via TTS. LiveKit Cloud is the managed hosting; self-host on Kubernetes is the alternative.
Five stages: (1) the agent worker joins the LiveKit room, (2) it captures audio (chunked at 20-40 ms) and video (1-4 fps) from participants, (3) it runs streaming STT, calls the LLM, and orchestrates function-calls, (4) it generates a response via streaming TTS and publishes back into the room, (5) it writes session state to memory. End-to-end latency is typically 1.2-2.5 seconds for voice-only, 1.8-4.5 seconds for multimodal.
OpenAI Realtime API is voice-only, GPT-4o-locked, and simpler to set up (2-4 weeks). LiveKit is multimodal-capable, LLM-provider-flexible, and can self-host (4-8 weeks setup). Use OpenAI direct for a single voice agent on a strong in-house team. Use LiveKit when you need real-time video, custom turn-taking, multiple LLM providers, or self-host for compliance.
LiveKit Cloud below 5,000 concurrent users with no compliance constraints. Self-host above 5,000 users, or with HIPAA / SOC2 / GDPR / FedRAMP, or when you need custom server-side routing. Self-host wins on cost above ~$5K/month of Cloud spend. Design the integration layer to be portable so the agent code does not change between Cloud and self-host.
A custom LiveKit AI agent costs $30K-$200K to build over 8-20 weeks. A 2-week Architecture Sprint at $15K-$25K produces a fixed-bid quote for the build. LiveKit Cloud usage at production scale runs $0.40-$2.00 per agent-hour plus per-participant-minute. Ongoing operations run $8K-$25K per month.
1.2-2.5 seconds median (P50) for a voice-only agent with streaming STT (Deepgram), GPT-4o, and ElevenLabs Flash for TTS. 1.8-4.5 seconds for a multimodal agent that adds vision-language model processing. Above 3.5 seconds, user engagement materially degrades. The largest single latency component is usually the LLM first-token latency (400-1,000 ms).
Yes, natively. Unlike OpenAI Realtime API (voice-only) or Vapi/Retell (limited video), LiveKit’s agent runtime treats video as a first-class media stream. Video frames sample at 1-4 fps into a vision-language model (GPT-4o vision, Claude 3.5 Sonnet, Gemini). Used in production for surveillance, healthcare, education, and live events.
You usually do not. Vapi and Retell are managed agent platforms with their own runtime; LiveKit is a framework for building your own agent. The pattern that does work: use LiveKit as the media transport layer and call Vapi or Retell as a function-call destination from inside a LiveKit agent, when you want to delegate one specific use case to a vendor.
Yes. LiveKit is shipped in production by hundreds of customers, including agent platforms, video-call apps, and developer tools. Fora Soft has shipped 30+ AI agents on LiveKit. The framework gets monthly releases; production deployments pin to specific versions and upgrade quarterly.
Three tactics. (1) Self-host LiveKit on Kubernetes with autoscaling on the SFU and agent-worker pools. (2) Maintain a warm pool of 2-3 idle agent workers per region to absorb cold-start latency. (3) Cascade SFUs across regions for users in different geographies. LiveKit’s open-source SFU scales linearly with bandwidth at the SFU node and with agent worker count for the AI side.
Default 2026: LiveKit (Cloud or self-host) for media, Deepgram for streaming STT, GPT-4o for LLM, ElevenLabs Flash for TTS, Pinecone for vector memory, Postgres for structured memory, OpenTelemetry for observability, Kubernetes for deployment. Swap Whisper for Deepgram on compliance constraints. Swap Cartesia for ElevenLabs when first-token TTS latency below 200 ms is a hard requirement.
Self-host LiveKit on BAA-able infrastructure (AWS, GCP, Azure all offer BAAs). Use a BAA-compliant LLM (OpenAI Enterprise, Azure OpenAI, AWS Bedrock). Use a BAA-compliant STT (Deepgram and AssemblyAI both offer HIPAA tiers). Use BAA-compliant TTS (ElevenLabs offers HIPAA on enterprise). Add full audit logging at the function-call layer, encrypted memory persistence, RBAC, and automatic session termination.
Each piece below extends one slice of this pillar — the WebRTC transport layer, the multimodal cross-cluster, the speech-translation specialization, the commercial path to commissioning a build, or the Fora Soft blog deep-dives.
If you are scoping a LiveKit AI agent and want a second opinion on cascaded-vs-S2S, the plugin matrix, the Cloud-vs-self-host threshold, or the EU AI Act compliance approach — write us. A senior engineer with shipped LiveKit agents in production replies within 24 hours.