Knowledge baseReal-time AI agents
LiveKit for AI agents · 2026 guide

A practical guide to LiveKit for AI agents — production architecture for voice and video AI.

How LiveKit AI agents actually work end-to-end at production scale. The five-stage runtime that every voice and video agent runs. The AgentSession primitive that unified the v0.x abstractions. The plugin ecosystem (Deepgram, AssemblyAI, Whisper · GPT-4o, Claude, Gemini · ElevenLabs, Cartesia, Azure) and the latency contributions of each layer. The Cloud-vs-self-host economics that cross over at ~10,000 minutes per month. Written from the agents we have shipped — Translinguist (16+ language pairs), VOLO.live (22K participants at Black Hat 2025), MindBox (industrial vision + voice).

20+ years in real-time voice & audio · since 2005
|
625+projects total
|
Sub-500 ms latency target · 1.4–1.7 s production median
Industry recognition · 2019–2025
Top WebRTC Developer
2022
category leader
Best Custom A/V Dev.
2025
category winner
Clutch Global
Spring 2024
top performer
APAC Insider 2024
Innovation
& Excellence
Clutch 5.0 / 30 reviews
Spring 2024
top performer
Quick answer

LiveKit is two layers in one open-source platform. livekit-server is a WebRTC SFU (Apache 2.0, Go) — same category as mediasoup, Janus, Pion. livekit-agents is the agent framework that runs on top: an agent worker joins a LiveKit room as a participant, runs streaming STT, calls an LLM, and publishes streaming TTS back into the room.

LiveKit Agents went 1.0 in April 2025; as of April 2026 the Python framework sits at 1.5.x with adaptive interruption handling and native Model Context Protocol (MCP) tool support. LiveKit Cloud is the managed hosting option for both layers; self-hosted on Kubernetes is the alternative. A production LiveKit AI agent has five stages: session join, media capture and processing, AI reasoning loop (STT → LLM → tool-calls → TTS), response generation and output, and continuous context.

End-to-end latency in 2026: sub-500 ms is achievable with disciplined streaming at every stage, but the industry-published median sits at 1.4–1.7 seconds with P99 at 3–5 seconds. The biggest single lever is LLM time-to-first-token. The biggest architectural decision is Cascaded (STT → LLM → TTS) vs Speech-to-Speech (OpenAI Realtime, Gemini Live) — most production agents in 2026 default to Cascaded for tool-calling reliability and observability.

A custom LiveKit AI agent costs $5K–$80K to build over 1–4 months. LiveKit Cloud at production scale runs $0.05–$0.18 blended per minute (~40× cheaper than human BPO). Above 10,000 minutes per month the framework path undercuts managed Vapi / Retell by 60–80%; below 500 minutes per month managed wins. Above 100,000 minutes per month, self-hosted LiveKit on Kubernetes typically wins on TCO within six months.

Topics & USE CASES covered in this guide

Voice agents, video agents, multimodal — and the LiveKit primitives that hold each one up.

Four shapes of LiveKit agent dominate the 2026 landscape. Each gets a different runtime configuration, a different plugin stack, and a different deployment shape.

The 5-stage runtime · latency cascade

Where every millisecond of a LiveKit agent turn actually goes.

Stage 1 fires once per session. Stages 2–5 run per turn. The bar widths below are proportional to each stage’s latency contribution at a balanced production stack (Deepgram + GPT-4o-mini + Cartesia). Click any stage row to see vendor stack, tuning knobs, and the failure mode that bites you in week one.

Sub-500 ms achievabledisciplined streaming + co-located regions
1.4–1.7 s production medianindustry-published, millions of real calls
3–5 s P99poorly-instrumented fleets · engagement degrades
0 ms 500 ms 1000 ms 1500 ms 2000 ms
Per-turn critical path = stages 2 + 3 + 4 ≈ ~1.4 s typical Stage 5 runs in parallel — never blocks the next turn
Stage 03 · AI reasoning loop (600–1400 ms)

STT → LLM → tool calls → (streaming response). The biggest single lever.

Streaming STT finalizes the user’s transcript and emits commit. The LLM streams its response token-by-token; if the LLM emits a tool call, the runtime executes it (with 2-second timeout and structured fallback), feeds the result back into the LLM context, and continues. The caller hears a brief filler so the pause feels intentional.

Vendor stackSTT (Deepgram Nova-3, AssemblyAI, Whisper) · LLM (GPT-4o-mini, Claude Haiku, Gemini Flash) · MCP toolsets
Key knobsStreaming first-token under 300 ms is more valuable than stronger reasoning at 1 s · conversation summarization every 5 turns to avoid context blowout
Failure modeTool calls timing out silently — bounded retry + structured fallback response the LLM verbalizes naturally

The 5-stage runtime is what the SDK orchestrates for you. Your code is the prompt, the tools, and the agent class — not the plumbing. Tune the knobs per vertical: IVR replacement wants short endpointing; sales wants longer; healthcare wants longer still.

Reference architecture · the LiveKit platform stack

Four layers, one AgentSession.

LiveKit isn’t a generic 12-slot grid. It’s an opinionated platform with four named layers, each owning a clear responsibility. The AgentSession primitive ties them together. Click any layer to see the modules inside.

Agents layer · the runtime that ties everything together
3.1

AgentSession

The unified primitive. Wires STT, LLM, TTS, VAD, turn detection, tool calling, telemetry. Lifecycle hooks at every transition.

3.2

Worker + Job dispatch

Long-lived worker process registers with LiveKit server, accepts job dispatches, spawns one AgentSession per matching room.

3.3

Turn detection

Silero acoustic VAD + LiveKit’s ~135M SmolLM-v2 semantic detector + adaptive interruption ML (1.5+).

3.4

Tool calling

@function_tool decorators for local tools, MCPToolset for MCP-server tools. Same tool list, mixed sources.

3.5

Multi-agent handoff

session.update_agent() swaps system prompt + tool set mid-conversation. Session state survives.

3.6

userdata pattern

Typed dataclass per session. Caller phone, account ID, RAG context, conversation summary. Read by tools, injected into LLM prompt via Jinja2.

The same agent code runs on LiveKit Cloud and self-hosted Kubernetes — the four layers above are identical in both. The fork is operational, not architectural. The next embed walks through the AgentSession primitive in detail.

AgentSession deep-dive · the unified primitive

One class, every voice agent.

In 0.x you picked between VoicePipelineAgent and MultimodalAgent. In 1.0+ both collapse into AgentSession. Pick a member to see its signature, when it fires, and the production pattern around it.

class AgentSession
Constructor
Methods
Lifecycle events
State pattern
constructor

AgentSession(stt, llm, tts, vad, turn_detection)

The unified primitive. Declare STT, LLM, TTS, VAD, and a turn detector as pluggable components. The SDK wires up streaming, end-of-utterance detection, barge-in cancellation, tool-call orchestration, and OpenTelemetry hooks.

from livekit.agents import AgentSession
from livekit.plugins import deepgram, openai, cartesia, silero

session = AgentSession(
    stt=deepgram.STT(model="nova-3"),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=cartesia.TTS(voice="sonic-3"),
    vad=silero.VAD.load(),
    turn_detection="multilingual",
)
Same AgentSession runs with llm=openai.realtime.RealtimeModel() to swap to speech-to-speech — no other code changes.

Workers register, jobs dispatch, AgentSession spawns — one per room. A Python worker holds 5–20 concurrent sessions per CPU core depending on plugin choice. Scale horizontally with more worker pods.

Turn detection · what holds up under real callers

Three signals stacked — the only way barge-in stops feeling robotic.

LiveKit Agents 1.5 fuses an acoustic VAD, a local semantic turn detector, and an audio-based interruption ML model. Below is a 10-second conversation with each signal’s catches highlighted. Click a signal to see what it owns.

User audio · 10 s sample
0s2s4s6s8s10s
Signal 03 · Adaptive interruption handling (1.5+)

The flagship 1.5 feature — rejects 51% of false barge-ins.

Audio-based ML model trained specifically to distinguish genuine user interruptions from incidental sounds: backchannels (“mm-hmm,” “uh-huh”), coughs, sighs, throat-clearing, background noise (TV, dog, kids, traffic). Enabled by default, no configuration required.

Reject rate51% of VAD-based barge-ins rejected (false positives avoided)
Accept latencyFaster than VAD in 64% of true barge-in cases
Production dataInterruption-misfire complaint rate dropped ~60% in Fora Soft fleet after 1.5 upgrade
P95 response latencyDropped 150–300 ms via dynamic endpointing — pauses adapt to caller rhythm
The blue interruption at 8.5s is “wait actually” — the model yields immediately. The red mark at 7.2s is “mm-hmm” — the model keeps the agent talking. Silero alone would have stopped the agent on the backchannel.
Production tuning — three knobs
min_endpointing_delay200–600 ms · minimum silence before agent treats turn as complete
interrupt_speech_duration200–500 ms · how long user must speak post-barge-in before agent yields
turn_detection"vad" · "stt" · "multilingual" / "english_v2" (recommended)

Acoustic VAD answers “is sound there.” Semantic detection answers “is the thought finished.” Adaptive interruption answers “is this actually a real interruption.” Three signals, three questions, one fused decision.

Plugin matrix · pick a stack that fits

Filter the LiveKit plugin ecosystem by what your build actually needs.

There are 8+ STT, 12+ LLM, 10+ TTS providers shipping LiveKit plugins in 2026. Pick your latency budget, cost ceiling, and compliance requirement — the matrix dims everything that doesn’t fit and surfaces the recommended stack at the bottom.

Latency target
Cost ceiling
Compliance
STT

Streaming speech-to-text

7 plugins
Deepgram Nova-3
first token 100–200 ms · $0.0043/min
defaultHIPAA
AssemblyAI Universal
150–300 ms · 99 languages
HIPAAmultilingual
OpenAI Whisper Realtime
200–400 ms · same-vendor with GPT
HIPAA Ent
Azure Speech
200–400 ms · 100+ langs
HIPAAFedRAMP
Google Cloud STT
150–350 ms · 125 langs
HIPAA
Whisper self-host
400–1000 ms · air-gapped
self-managed
Speechmatics
200–400 ms · UK accent strength
HIPAA
LLM

Streaming reasoning + tool calling

8 plugins
GPT-4o-mini
first token 150–250 ms · 80–120 tps
defaulttool-call 95%
GPT-4o
200–400 ms · strongest reasoning
HIPAA Enttool-call 98%
Claude Haiku 3.5
150–300 ms · vision
HIPAA
Claude Sonnet 3.5
300–500 ms · strongest reasoning + vision
HIPAAtool-call 98%
Gemini 2.5 Flash
200–400 ms · cost-tuned scale
HIPAA
Gemini 2.5 Pro
300–600 ms · 2M context
HIPAAlong-ctx
Llama / Mistral self-host
100–800 ms · air-gapped, regulated
self-managed
OpenAI Realtime (S2S)
n/a (audio in/out) · lower tool reliability
HIPAA EntS2S
TTS

Streaming text-to-speech

7 plugins
Cartesia Sonic 3
TTFB 70–150 ms · fastest streaming
defaultHIPAA
ElevenLabs Flash v3
100–200 ms · most expressive
HIPAA Entpremium voice
OpenAI TTS
200–400 ms · same-vendor stack
HIPAA Ent
Azure Neural
150–300 ms · 400+ voices · cheapest at scale
HIPAAFedRAMP
Deepgram Aura
100–200 ms · single-vendor with Deepgram STT
HIPAA
Google Cloud TTS
200–350 ms · quality stable across langs
HIPAA
PlayHT
200–400 ms · voice cloning · 800+ voices
HIPAA Entcloning
Recommended stack for current filters
STTDeepgram Nova-3
LLMGPT-4o-mini
TTSCartesia Sonic 3

Default 2026 production stack — matches every filter combination. ~$0.083 per minute.

LiveKit Cloud also ships LiveKit Inference — managed STT/LLM/TTS co-located with the SFU. Single billing surface, lowest cross-region latency. For ~80% of greenfield builds in 2026, Inference is the right default before swapping individual plugins for BYOK.

Cascaded vs S2S · the architectural fork

Weight your priorities — the architecture follows.

Four priority sliders below. Move each one to reflect how much your build cares about that dimension. The three architecture options (Cascaded, Speech-to-Speech, Hybrid) score themselves against your weighting in real time. The winner glows.

Cascaded

STT → LLM → TTS

70

Three discrete stages, each instrumentable, each swappable. Best for tool-heavy workflows + regulated PII paths. Sub-1.5 s achievable on streaming providers.

tool-call ~98% stage-boundary redaction ~1.4–1.7 s median
S2S

OpenAI Realtime / Gemini Live

62

Single multimodal model swallows audio in, emits audio out. Sub-300 ms achievable. Tool calling 75–88% on multi-arg tools — gap closing through 2026.

sub-300 ms natural prosody no stage redaction
Hybrid

Cascaded hot-path + S2S rapport

70

2026’s winning pattern. Cascaded for tool-heavy turns (auth, lookup, write). S2S for the human-feel sections (small talk, escalation). Same AgentSession; swap on the fly.

balanced 2026 default for >10K min/mo most engineering
Verdict for current weights Cascaded and Hybrid tied — rerun with stronger weights to break the tie.

The same AgentSession code runs cascaded or speech-to-speech — llm=openai.LLM() vs llm=openai.realtime.RealtimeModel() is a one-line swap. The decision is operational + product, not architectural.

MCP & tool calling · letting agents actually do work

Three patterns for tool calling in production.

A voice chatbot becomes a voice agent the moment it can take action mid-conversation. LiveKit Agents 1.5 ships local function tools + first-class MCP support. Below: the three production patterns and the tool-call reliability gradient by LLM.

Pattern 03 · hybrid (the 2026 default)

Local for hot-path, MCP for everything else.

Real production agents end up here. Latency-critical tools (lookup the caller’s account, fetch session state) stay local. Cross-team integrations (charge a card via Stripe, file a Zendesk ticket, query Salesforce) go via MCP. Same tool list, mixed sources.

agent = Agent(
    instructions="You are a customer support agent.",
    tools=[
        get_order_status,        # local function tool
        book_callback,           # local function tool
        crm_mcp_toolset,         # MCP server (CRM team owns)
        billing_mcp_toolset,     # MCP server (Finance team owns)
    ],
)
Read cheap, write expensive. Let the agent look up anything; put writes behind “yes, go ahead.”
2-second timeout + structured fallback. Every tool wrapper. “System unavailable” beats silence.
Log every tool call. EU AI Act (Article 50, August 2026) makes this non-optional.
Tool-call reliability by LLM · Fora Soft fleet, 30+ shipped agents
Schema correctness on first attempt · multi-arg tools
GPT-4o (text)
98%
Claude 3.5 Sonnet
98%
GPT-4o-mini
95%
Gemini 2.5 Flash
93%
OpenAI Realtime S2S
88%

Takeaway: cascaded GPT-4o or Claude Sonnet for any agent where tool-call reliability is critical — regulated workflows, write actions, multi-tool orchestration. S2S is closing the gap fast but still trails on multi-arg tools through early 2026.

2026 MCP catalog: Stripe, Salesforce, HubSpot, Zendesk, Intercom, Notion, Linear, GitHub, Google Calendar, Microsoft Graph, Slack, AWS, Postgres, Snowflake, Twilio — plus the internal MCP servers your own teams ship.

SIP & telephony · putting the agent on a phone number

A phone call is just another LiveKit participant.

LiveKit SIP and LiveKit Phone Numbers went GA in 2025. A voice agent now accepts a call from any phone with ~4 lines of config. Inbound, outbound, DTMF, SIP REFER, attended transfer — all first-class.

Inbound · step 03

LiveKit SIP matches a dispatch rule.

Dispatch rules map inbound calls to LiveKit rooms. The SIP service authenticates the trunk, matches the rule, and creates a SIP participant in a new (or existing) room. Workers watching that room get the dispatch.

from livekit.protocol.sip import (
    CreateSIPDispatchRuleRequest,
    SIPDispatchRuleIndividual,
)

dispatch_rule = sip_service.create_sip_dispatch_rule(
    CreateSIPDispatchRuleRequest(
        rule=SIPDispatchRuleIndividual(room_prefix="call-"),
        name="inbound-support",
        trunk_ids=[trunk.sip_trunk_id],
    )
)
Outbound · step 02

LiveKit SIP creates an outbound SIP participant.

One API call. Your server picks a SIP trunk, the destination phone number, and the LiveKit room. LiveKit returns an INVITE handle you can poll for state.

from livekit.protocol.sip import (
    CreateSIPParticipantRequest,
)

participant = sip_service.create_sip_participant(
    CreateSIPParticipantRequest(
        sip_trunk_id=trunk.sip_trunk_id,
        sip_call_to="+14155551234",
        room_name="outbound-followup",
        participant_identity="customer-1234",
    )
)
Telephony primitives shipped with LiveKit SIP
DTMFTouch-tones in (ID verify, IVR navigation) and out (transferring through downstream IVRs).
SIP REFERCold transfer to another phone number. LiveKit drops out; the human takes over.
Attended transferWarm transfer — agent dials the human, briefs them, then bridges the customer.
Hold & resumeMute agent audio while it queries a tool. Music-on-hold or silence. Resume cleanly.

Pricing: LiveKit Phone Numbers is competitive per-minute with Twilio but loses on per-number monthly fees under ~500 minutes/month/number. Above that, the native option simplifies the operational surface.

Cloud vs self-host · the deployment fork

Same code. Different infrastructure. Plan for the switch.

Every LiveKit project hits this fork — usually around minute 60 of the architecture call. The right answer changes by 6–12 months as the product scales. Design for reversibility from day one. The same AgentSession code runs identically on both.

Start here · What does your deployment need?
LiveKit Cloud

The managed option — default for the first 12–18 months of most products.

LiveKit Cloud is the managed SFU + agent dispatch + Inference platform. Free tier through Build / Ship ($50/mo) / Scale ($500/mo) / Enterprise. Autoscaling worker pools, multi-region, observability built in.

Wins whenSessions < 5K concurrent · light compliance · no K8s capacity · ship in 4–8 weeks
Pricing model$0.01/agent-active-minute + SFU participant minutes + Inference credits + LiveKit Phone Numbers
50-agent ops example50 agents × 8 hr/day ≈ $5K–$15K/month all-in (Cloud + Deepgram + GPT-4o-mini + Cartesia)
ScalingAutomatic. Worker pool grows on dispatch queue depth.
Loses when
  • Sessions clear 5K concurrent
  • HIPAA + BAA chain required across every plugin
  • GDPR EU-only data residency
  • FedRAMP / sovereign-cloud requirements
  • Custom server-side gating LiveKit Cloud doesn’t expose
TCO crossover · cost as a function of monthly minutes
Below 10K min/moCloud wins on TCO once engineering time is counted.
10K–100K min/moRoughly comparable. Pick by compliance + engineering capacity, not by cost.
Above 100K min/moSelf-host typically wins within 6–9 months. Disclosed savings: 30–50% per call.

The agent code does not change between Cloud and self-host. The infrastructure layer does. Keep the integration layer portable from day one, and the decision becomes operational, not architectural.

Observability & SRE · what to measure, what to page on

Seven metrics. Four alerts. Everything else is dashboards.

A voice agent failure is subtle: a 3-second LLM TTFT, a tool that returned the wrong row, a TTS that mispronounced the caller’s name. None throw exceptions. Per-turn telemetry + nightly evals are the only path to operational confidence.

LiveKit fleet · live
Prometheus :8081 · OTLP · refresh 30 s
Metric 01 · LLM TTFT (time-to-first-token)

The latency budget’s biggest single line item.

Time from end-of-user-utterance to first LLM token. Anything above 1 s breaks the conversational feel. Usually the dominant component of P95 end-to-end latency.

alert · LLM_TTFT_p95_high
  if P95(llm_ttft_ms) > 1200 for 5m
  group_by: model, region
  page: senior on-call
  remediation: check provider status, rotate to fallback model

Common causes: model-provider degradation, context bloat (forgot to summarize long sessions), cross-region routing.

The four alerts that actually pay rent
01P95 LLM TTFT > 1.2 s sustained 5 minutesmodel provider issue or context bloat
02Tool-call success < 90% over 15 minutesfunction-call regression or downstream outage
03Interruption rate > 25% over 1 houradaptive interruption miscalibrated for new caller mix
04Worker capacity > 80% on any region 5+ minutescapacity exhaustion — autoscale lagging

Most LiveKit fleets drown in noise. These four pay rent — everything else is a dashboard you check at standup, not a page.

LiveKit Agents ships OpenTelemetry traces + Prometheus metrics on port 8081 + lifecycle hooks at every transition. Wire them to Grafana / Datadog / Honeycomb, alert on percentiles, never on means. Build the nightly eval harness in week 2 — not month 6.

Cost calculator · what production actually costs

Per-minute math, payback in days, vs human BPO.

Pick the stack tier, drag the volume slider, toggle telephony, choose the deployment mode. Monthly cost, per-call cost, and payback period update live. Numbers are 2026-indicative; real costs come in 20–30% higher with retries and ops overhead.

Stack tier

Balanced: Deepgram Nova-3 + GPT-4o-mini + Cartesia. Default 2026 production stack.

Monthly minutes 12,000
1K10K100K500K
Telephony (PSTN)

Adds LiveKit Phone Numbers or Twilio trunk on top of agent costs.

Deployment

LiveKit Cloud: managed. Below 5K concurrent, no compliance hard-stops, no Kubernetes capacity.

Monthly cost $996 at $0.083/min × 12,000 min
Per-call cost $0.25 assuming 3-min average call
Build cost (MVP) $30K 1–2 months · production-ready MVP
Saved vs $9/call BPO $35,004 / mo 4,000 calls/mo replaced · 40% containment assumption
Build payback 26 days monthly savings rate × build cost

Reality check: first production deployment typically comes in 20–30% higher than the table above. Real calls include retries, unused tokens, ops overhead, and the occasional vendor-pricing surprise. Containment rates of 40–70% on tier-1 support cap the realized savings; even at 40% containment, payback is usually under 10 weeks.

Framework decision tool · 4 questions, 6 frameworks

Answer four questions — the framework ranking updates live.

Pick the option that matches your build for each of the four dimensions below. The six frameworks (LiveKit, OpenAI Realtime API, Vapi, Retell, Pipecat, Bland) score themselves against your answers in real time. The winner glows.

01

Monthly call volume

02

Customization depth

03

Time to first call

04

What matters most

Framework fit · ranked by current answers
1 LiveKit Agents
12
2 OpenAI Realtime API
11
3 Vapi
14
4 Retell AI
10
5 Pipecat
12
6 Bland AI
13
Verdict Adjust your answers to see how the ranking shifts.

We’ll tell you on the call if your build doesn’t need us. We’ve walked away from projects that did better on Vapi or stayed on direct OpenAI Realtime — both legit answers for the right shape. The tool is a starting point; the architecture call decides.

Production patterns · what holds up under real traffic

Six patterns every shipped LiveKit agent needs.

Across 30+ shipped agents, these six patterns recur in every production build. Skip any of them and the agent regresses inside a month. Click any pattern to see the implementation + the failure mode it prevents.

01

Warm worker pools

Cold-start agent workers add 1.5–3 seconds before the first turn. Keep 2–3 pre-warmed workers per region; autoscale on dispatch queue depth, not after the queue saturates.

Implementation
  • LiveKit Cloud: handled automatically — scale up before queue saturation
  • Self-host on K8s: HPA on active-session metric, minReplicas set to peak/3
  • prewarm function in worker code: load VAD model + vector index at process startup
Without it

P99 join latency spikes 2–4 s on traffic ramps. Customers hear silence after dialing. Marketing pushes look like outages.

02

Tool timeouts and graceful fallbacks

Every tool wraps with a 2-second timeout and a structured fallback response the LLM can verbalize naturally. “Trouble reaching that system…” beats silence or hallucinated tool output.

Implementation
@function_tool
async def get_account(ctx, account_id: str) -> dict:
    try:
        return await asyncio.wait_for(
            crm.fetch(account_id), timeout=2.0
        )
    except (TimeoutError, ServiceUnavailable):
        return {"status": "unavailable",
                "fallback_message":
                "I'm having trouble reaching the account "
                "system. Would you prefer I take a message?"}
Without it

Tool hangs → LLM waits → user hears dead air → user hangs up. Or worse: LLM hallucinates a plausible-sounding response based on tool history.

03

PII redaction at stage boundaries

In regulated workflows, redact PII between stages. Drop credit-card digits from the STT transcript before the LLM sees them. Drop diagnosis details from the LLM response before TTS reads them.

Implementation
  • Regex-mask card numbers in the STT->LLM hook (on_user_speech_committed)
  • PCI: route card-entry digits via DTMF straight to the payment processor (LLM never sees them)
  • HIPAA: configurable redaction layer at the function-call boundary, persisted in audit log redacted
  • Both directions: anything LLM speaks back becomes recorded audio
Without it

PCI DSS violation on the first recorded card number. HIPAA breach on the first transcript leak. The audit finds it before you do.

04

Conversation summarization for long sessions

Calls longer than ~10 minutes blow LLM context limits. Every 5 turns, summarize prior conversation into a compact passage; replace older turns with the summary in the prompt. Keeps context bounded; preserves continuity.

Implementation
  • Trigger on turn count (every 5 turns) or token count (every 6K tokens)
  • Summarization model: same LLM, lightweight prompt, < 200 tokens output
  • Persist summary in session.userdata for replay / audit
  • Inject summary in system prompt via Jinja2 template
Without it

P95 LLM TTFT walks up linearly with call duration. By minute 12 the agent is at 3+ second response times. By minute 20 it crashes the context window outright.

05

Persona consistency via prompt versioning

LLM responses drift across model versions and prompt edits. Version your system prompt. Tag every turn with the prompt version. Run the eval harness on every prompt change before shipping.

Implementation
  • System prompts in version control, semver tagged
  • Prompt version emitted as an OTel span attribute on every turn
  • Eval harness blocks prompt-version bumps that drop persona score below threshold
  • Multi-agent stacks: separate prompt versions per Agent class
Without it

You changed the prompt 3 weeks ago. Customer complains today. You have no way to know which version generated the bad response. Or which version to roll back to.

06

Eval-driven model upgrades

When OpenAI ships GPT-5o or Anthropic ships Claude 4 Opus, do not just swap in production. Run the eval harness in staging; compare pass rate, latency, cost; ship only if the new model wins on all three.

Implementation
  • Eval harness: 100–500 canned scenarios with rubric grading
  • Rubric covers: tool-call correctness, factual accuracy, persona, refusal behavior, PII redaction, latency budget
  • Blue-green deploy with explicit rollback path
  • Block model bumps below threshold — even if the vendor changelog says “25% faster”
Without it

The new model is 25% faster but 6% worse at tool calling. Production-incident MTTR is 4–72 hours depending on how long the regression takes to surface in complaints. Rollback isn’t free either — user-visible.

Plus a 7th meta-pattern: instrument before launch. Per-turn telemetry + nightly eval are the only path to operational confidence on a system this complex.

Compliance · five regulatory frames every voice agent plans around

Voice agents touch more regulation than text agents.

Live audio is typically classified as biometric data. Conversations get recorded. Tool calls take actions on user systems. Five regulatory frames every production LiveKit deployment plans around. Click any zone for scope + the Fora Soft production pattern.

EU AI Act · 2 August 2026

General-purpose AI obligations land in 90 days.

If the agent serves EU users, three things become non-optional. Engineering effort to comply: 4–8 engineer-weeks for a standard-risk use case. Plan for it before the deadline.

Article 50 disclosure“You are speaking with an AI assistant.” Logged and consent-captured. Spoken in the first 5 seconds.
Logged interaction recordStructured logs with timestamps, model versions, tool calls, agent decisions. Retained per Article 50.
Foundation-model complianceDocumentation that the underlying model is compliant (vendor model card + EU AI Act conformance documentation)
Engineering cost4–8 weeks for standard-risk · longer for high-risk classifications (healthcare triage, employment, credit)
What breaks if missingInability to legally serve EU users post-August 2026. Fines up to 7% of global annual turnover.
Two-party recording-consent states (US)
California Florida Illinois Maryland Massachusetts Montana Nevada New Hampshire Pennsylvania Washington

Agent opening: “This call may be recorded for quality and AI training purposes. Press 1 or say ‘yes’ to continue, or stay on the line to opt out.”

Documentation as the audit primitive
  • AI disclosure script (verbatim, per language)
  • Recording retention policy (per jurisdiction)
  • Tool catalog the agent is allowed to call
  • Escalation rule for unhappy callers, suicidality, medical emergencies
  • Data residency map (which user audio lives where)

Document-then-enforce, not the other way around. That document is the regulator-ready audit trail.

Compliance is engineering, not a checklist. Build the BAA chain, the audit log, the consent capture, the disclosure script before the first customer call — then operate within them.

Production examples

Four shipped AI agents across four very different shapes.

An AI sales-coach analyzing customer behavior in real time. An AI stylist running on-device on iPhone. Real-time speech-to-speech translation across 75+ languages. AI-powered video management at 99.5%+ facial-recognition accuracy. Four production builds running today — each a different shape of AI agent on top of real-time media.

Meetric · AI sales conversation intelligence

Real-time customer behavior + AI speech coaching

Swedish AI sales video platform that analyzes customer behavior during live presentations, identifies speech strengths and weaknesses, and generates post-meeting reports. Integrates with Google Meet, MS Teams, Zoom. Captures conversations across video, phone, email, chat. Outcomes: up to 25% lift in close rates, 30× coaching efficiency, 80–100% CRM data-entry automation. SEK 21M (~$2.25M) seed in 2025.

AI Wardrobe App · on-device AI stylist

YOLOv8m + CLIP recognition, fully on-device

Personalized AI stylist on iOS in the $4.3B AI-wardrobe market (projected $13.5B by 2033). Custom-trained YOLOv8m for garment object detection plus CLIP for semantic understanding — all running on-device via TensorFlowLite. Recognizes garment type, season, color, fabric, sleeve length, neckline. Generates outfit suggestions based on wardrobe + weather + occasion. No round-trip to the cloud.

TransLinguist · AI speech-to-speech translation

75+ languages · NHS UK national framework

Video conferencing platform for professional interpreters — marketplace of 30,000+ certified interpreters, estimated $4.2M annual revenue. AI speech-to-speech translation in 16+ languages with closed captioning in 22 languages. Won the NHS (NOE CPC) national framework across NHS, councils, schools, police, fire/rescue. Clients report 50% cost savings, 80% reduction in interpreting costs, 2× ROI in two years.

MindBox · AI video management system

99.5%+ facial recognition · 500K+ vehicles/day ANPR

AI-powered intelligent video management system — 50+ deployments across transport, pharma, and gated communities since 2020. Facial recognition at 99.5%+ accuracy beats Google + Facebook benchmarks, with anti-spoofing resistant to photo/video attacks. ANPR module captures license plates of 500,000+ vehicles daily across India at ~95% accuracy. Real-time anomaly alerts trigger automatic recording on intrusion, fire, or crowd buildup.

Decision framework

Build vs Buy vs Hybrid — when each one wins.

Three architectural paths for shipping a real-time AI voice or video agent. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether the platform is the product or supports the product.

Build on LiveKit

When you need control SDK won’t give you

Wins when: real-time video on the agent, LLM provider flexibility, self-host option for compliance, usage above 10K minutes/month, custom turn-taking / voice cloning / RAG, brand-embedded experience, multi-tenant SaaS plays.

Cost shape: $5K–$80K build over 1–4 months. $1K–$5K monthly operations. LiveKit Cloud $0.05–$0.18 per minute (or self-host above 100K min/mo for 30–50% savings). Archetypes: Mindwibe, Translinguist, VOLO.live, MindBox.

Buy managed

When speed matters more than customization

Wins when: under 10K minutes/month, standard call patterns, no compliance pressure, no in-house engineering capacity. Vapi for developer-first teams (~2 hours to first call). Retell for visual-builder non-technical teams. Bland for outbound-heavy TCPA workflows.

Cost shape: $0.05–$0.13 per minute (Vapi list) or $0.23–$0.50 BYOK. No upfront build cost. Vendor lock-in.

Hybrid

Managed pilot, framework scale-up

Wins when: validating voice UX before committing to a framework, running staging on Vapi / Retell while building production on LiveKit, or half the workload fits managed (the voicemail leg) while the other half needs LiveKit (the live agent leg).

Pattern: start on Vapi / Retell for 4–8 weeks to validate; build the LiveKit production version once usage clears 10K minutes/month; cut over with a portability layer so prompt and tools transfer.

Cost ranges are 2026-indicative. Implementation specifics — concurrency target, compliance scope, multimodal vs voice-only, tool-call surface, multi-region cascade — dominate the spread within each tier.

A custom LiveKit AI agent costs $5K–$80K to build over 1–4 months. The break-even point vs managed Vapi / Retell sits around 10,000 minutes per month — below, managed ships faster; above, LiveKit undercuts on per-minute economics by 60–80%.
FAQ

Twelve questions every LiveKit architecture review covers.

What is LiveKit and what are LiveKit agents?

LiveKit is an open-source WebRTC SFU paired with an agent framework (livekit-agents) for building real-time voice and video AI agents. The SFU handles media transport. The agent framework runs as a participant in the room, processes audio and video through STT, calls an LLM, and publishes responses via TTS. LiveKit Cloud is the managed hosting; self-host on Kubernetes is the alternative.

How do LiveKit AI agents work?

Five stages: (1) the agent worker joins the LiveKit room, (2) it captures audio (chunked at 20-40 ms) and video (1-4 fps) from participants, (3) it runs streaming STT, calls the LLM, and orchestrates function-calls, (4) it generates a response via streaming TTS and publishes back into the room, (5) it writes session state to memory. End-to-end latency is typically 1.2-2.5 seconds for voice-only, 1.8-4.5 seconds for multimodal.

LiveKit vs OpenAI Realtime API — which should I use?

OpenAI Realtime API is voice-only, GPT-4o-locked, and simpler to set up (2-4 weeks). LiveKit is multimodal-capable, LLM-provider-flexible, and can self-host (4-8 weeks setup). Use OpenAI direct for a single voice agent on a strong in-house team. Use LiveKit when you need real-time video, custom turn-taking, multiple LLM providers, or self-host for compliance.

Should I use LiveKit Cloud or self-host?

LiveKit Cloud below 5,000 concurrent users with no compliance constraints. Self-host above 5,000 users, or with HIPAA / SOC2 / GDPR / FedRAMP, or when you need custom server-side routing. Self-host wins on cost above ~$5K/month of Cloud spend. Design the integration layer to be portable so the agent code does not change between Cloud and self-host.

How much does it cost to build a LiveKit AI agent?

A custom LiveKit AI agent costs $30K-$200K to build over 8-20 weeks. A 2-week Architecture Sprint at $15K-$25K produces a fixed-bid quote for the build. LiveKit Cloud usage at production scale runs $0.40-$2.00 per agent-hour plus per-participant-minute. Ongoing operations run $8K-$25K per month.

What is the typical end-to-end latency for a LiveKit voice agent?

1.2-2.5 seconds median (P50) for a voice-only agent with streaming STT (Deepgram), GPT-4o, and ElevenLabs Flash for TTS. 1.8-4.5 seconds for a multimodal agent that adds vision-language model processing. Above 3.5 seconds, user engagement materially degrades. The largest single latency component is usually the LLM first-token latency (400-1,000 ms).

Can LiveKit handle real-time video AI agents?

Yes, natively. Unlike OpenAI Realtime API (voice-only) or Vapi/Retell (limited video), LiveKit’s agent runtime treats video as a first-class media stream. Video frames sample at 1-4 fps into a vision-language model (GPT-4o vision, Claude 3.5 Sonnet, Gemini). Used in production for surveillance, healthcare, education, and live events.

How do I integrate Vapi or Retell with LiveKit?

You usually do not. Vapi and Retell are managed agent platforms with their own runtime; LiveKit is a framework for building your own agent. The pattern that does work: use LiveKit as the media transport layer and call Vapi or Retell as a function-call destination from inside a LiveKit agent, when you want to delegate one specific use case to a vendor.

Is LiveKit production-ready in 2026?

Yes. LiveKit is shipped in production by hundreds of customers, including agent platforms, video-call apps, and developer tools. Fora Soft has shipped 30+ AI agents on LiveKit. The framework gets monthly releases; production deployments pin to specific versions and upgrade quarterly.

How do I scale LiveKit agents to 10,000+ concurrent sessions?

Three tactics. (1) Self-host LiveKit on Kubernetes with autoscaling on the SFU and agent-worker pools. (2) Maintain a warm pool of 2-3 idle agent workers per region to absorb cold-start latency. (3) Cascade SFUs across regions for users in different geographies. LiveKit’s open-source SFU scales linearly with bandwidth at the SFU node and with agent worker count for the AI side.

What stack do you recommend for a LiveKit voice agent?

Default 2026: LiveKit (Cloud or self-host) for media, Deepgram for streaming STT, GPT-4o for LLM, ElevenLabs Flash for TTS, Pinecone for vector memory, Postgres for structured memory, OpenTelemetry for observability, Kubernetes for deployment. Swap Whisper for Deepgram on compliance constraints. Swap Cartesia for ElevenLabs when first-token TTS latency below 200 ms is a hard requirement.

How do you handle HIPAA compliance with LiveKit?

Self-host LiveKit on BAA-able infrastructure (AWS, GCP, Azure all offer BAAs). Use a BAA-compliant LLM (OpenAI Enterprise, Azure OpenAI, AWS Bedrock). Use a BAA-compliant STT (Deepgram and AssemblyAI both offer HIPAA tiers). Use BAA-compliant TTS (ElevenLabs offers HIPAA on enterprise). Add full audit logging at the function-call layer, encrypted memory persistence, RBAC, and automatic session termination.

Where this guide goes deeper

Connected guides and references.

Each piece below extends one slice of this pillar — the WebRTC transport layer, the multimodal cross-cluster, the speech-translation specialization, the commercial path to commissioning a build, or the Fora Soft blog deep-dives.

Have a specific LiveKit architecture question?

Engineer-to-engineer review on the first call.

If you are scoping a LiveKit AI agent and want a second opinion on cascaded-vs-S2S, the plugin matrix, the Cloud-vs-self-host threshold, or the EU AI Act compliance approach — write us. A senior engineer with shipped LiveKit agents in production replies within 24 hours.

+1 (914) 775-5855
New York · USA
Specialist software house for video, real-time and AI products. Founded 2005.
50 in-house engineers.
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.