LiveKit for AI agents · 2026 guide

A practical guide to LiveKit for AI agents — production architecture for voice and video AI.

How LiveKit AI agents actually work end-to-end at production scale. The five-stage runtime that every voice and video agent runs. The AgentSession primitive that unified the v0.x abstractions. The plugin ecosystem (Deepgram, AssemblyAI, Whisper · GPT-4o, Claude, Gemini · ElevenLabs, Cartesia, Azure) and the latency contributions of each layer. The Cloud-vs-self-host economics that cross over at ~10,000 minutes per month. Written from the agents we have shipped — Translinguist (16+ language pairs), VOLO.live (22K participants at Black Hat 2025), MindBox (industrial vision + voice).

Check the plugin picker Learn how AI agents work with SIP

20+ years in real-time voice & audio · since 2005

|

625+projects total

|

Sub-500 ms latency target · 1.4–1.7 s production median

Industry recognition · 2019–2025

Top WebRTC Developer

2022
category leader

Best Custom A/V Dev.

2025
category winner

Clutch Global

Spring 2024
top performer

APAC Insider 2024

Innovation
& Excellence

Clutch 5.0 / 30 reviews

Spring 2024
top performer

Quick answer

LiveKit is two layers in one open-source platform. livekit-server is a WebRTC SFU (Apache 2.0, Go) — same category as mediasoup, Janus, Pion. livekit-agents is the agent framework that runs on top: an agent worker joins a LiveKit room as a participant, runs streaming STT, calls an LLM, and publishes streaming TTS back into the room.

LiveKit Agents went 1.0 in April 2025; as of April 2026 the Python framework sits at 1.5.x with adaptive interruption handling and native Model Context Protocol (MCP) tool support. LiveKit Cloud is the managed hosting option for both layers; self-hosted on Kubernetes is the alternative. A production LiveKit AI agent has five stages: session join, media capture and processing, AI reasoning loop (STT → LLM → tool-calls → TTS), response generation and output, and continuous context.

End-to-end latency in 2026: sub-500 ms is achievable with disciplined streaming at every stage, but the industry-published median sits at 1.4–1.7 seconds with P99 at 3–5 seconds. The biggest single lever is LLM time-to-first-token. The biggest architectural decision is Cascaded (STT → LLM → TTS) vs Speech-to-Speech (OpenAI Realtime, Gemini Live) — most production agents in 2026 default to Cascaded for tool-calling reliability and observability.

A custom LiveKit AI agent costs $5K–$80K to build over 1–4 months. LiveKit Cloud at production scale runs $0.05–$0.18 blended per minute (~40× cheaper than human BPO). Above 10,000 minutes per month the framework path undercuts managed Vapi / Retell by 60–80%; below 500 minutes per month managed wins. Above 100,000 minutes per month, self-hosted LiveKit on Kubernetes typically wins on TCO within six months.

Topics & USE CASES covered in this guide

Voice agents, video agents, multimodal — and the LiveKit primitives that hold each one up.

Four shapes of LiveKit agent dominate the 2026 landscape. Each gets a different runtime configuration, a different plugin stack, and a different deployment shape.

01 / Voice

Voice agents — sub-500 ms target

Cascaded STT → LLM → TTS pipeline. Streaming Deepgram + GPT-4o-mini + Cartesia is the 2026 default. End-to-end latency 450–950 ms achievable; production median 1.4–1.7 s. Common use cases: tier-1 support, appointment booking, outbound qualification.

DeepgramGPT-4o-miniCartesia

02 / Video

Video agents — vision-language on the loop

LiveKit’s agent runtime treats video as a first-class media stream. Frames sampled at 1–4 fps into GPT-4o vision, Claude 3.5 Sonnet, or Gemini 2.5. Production examples: skin / wound assessment, industrial site monitoring, instructional review.

GPT-4o visionClaude 3.5Gemini 2.5

03 / Multimodal

Multimodal agents — voice + video + screen + data

The full pattern: voice in, vision-language reasoning, screen-share understanding, tool-calls into business systems, structured response back as voice. End-to-end latency 2.0–4.5 s at production quality.

LiveKitVLMRAGMCP

04 / Phone

Phone agents — SIP-native since 2025

LiveKit SIP and Phone Numbers shipped GA in 2025. A LiveKit agent now answers any phone on Earth with ~4 lines of configuration. PSTN inbound, outbound INVITE, DTMF, SIP REFER all first-class.

LiveKit SIPPSTNDTMFSIP REFER

The 5-stage runtime · latency cascade

Where every millisecond of a LiveKit agent turn actually goes.

Stage 1 fires once per session. Stages 2–5 run per turn. The bar widths below are proportional to each stage’s latency contribution at a balanced production stack (Deepgram + GPT-4o-mini + Cartesia). Click any stage row to see vendor stack, tuning knobs, and the failure mode that bites you in week one.

Sub-500 ms achievabledisciplined streaming + co-located regions

1.4–1.7 s production medianindustry-published, millions of real calls

3–5 s P99poorly-instrumented fleets · engagement degrades

0 ms 500 ms 1000 ms 1500 ms 2000 ms

Per-turn critical path = stages 2 + 3 + 4 ≈ ~1.4 s typical Stage 5 runs in parallel — never blocks the next turn

Stage 03 · AI reasoning loop (600–1400 ms)

STT → LLM → tool calls → (streaming response). The biggest single lever.

Streaming STT finalizes the user’s transcript and emits commit. The LLM streams its response token-by-token; if the LLM emits a tool call, the runtime executes it (with 2-second timeout and structured fallback), feeds the result back into the LLM context, and continues. The caller hears a brief filler so the pause feels intentional.

Vendor stackSTT (Deepgram Nova-3, AssemblyAI, Whisper) · LLM (GPT-4o-mini, Claude Haiku, Gemini Flash) · MCP toolsets

Key knobsStreaming first-token under 300 ms is more valuable than stronger reasoning at 1 s · conversation summarization every 5 turns to avoid context blowout

Failure modeTool calls timing out silently — bounded retry + structured fallback response the LLM verbalizes naturally

The 5-stage runtime is what the SDK orchestrates for you. Your code is the prompt, the tools, and the agent class — not the plumbing. Tune the knobs per vertical: IVR replacement wants short endpointing; sales wants longer; healthcare wants longer still.

Reference architecture · the LiveKit platform stack

Four layers, one AgentSession.

LiveKit isn’t a generic 12-slot grid. It’s an opinionated platform with four named layers, each owning a clear responsibility. The AgentSession primitive ties them together. Click any layer to see the modules inside.

Agents layer · the runtime that ties everything together

3.1

AgentSession

The unified primitive. Wires STT, LLM, TTS, VAD, turn detection, tool calling, telemetry. Lifecycle hooks at every transition.

3.2

Worker + Job dispatch

Long-lived worker process registers with LiveKit server, accepts job dispatches, spawns one AgentSession per matching room.

3.3

Turn detection

Silero acoustic VAD + LiveKit’s ~135M SmolLM-v2 semantic detector + adaptive interruption ML (1.5+).

3.4

Tool calling

@function_tool decorators for local tools, MCPToolset for MCP-server tools. Same tool list, mixed sources.

3.5

Multi-agent handoff

session.update_agent() swaps system prompt + tool set mid-conversation. Session state survives.

3.6

userdata pattern

Typed dataclass per session. Caller phone, account ID, RAG context, conversation summary. Read by tools, injected into LLM prompt via Jinja2.

The same agent code runs on LiveKit Cloud and self-hosted Kubernetes — the four layers above are identical in both. The fork is operational, not architectural. The next embed walks through the AgentSession primitive in detail.

AgentSession deep-dive · the unified primitive

One class, every voice agent.

In 0.x you picked between VoicePipelineAgent and MultimodalAgent. In 1.0+ both collapse into AgentSession. Pick a member to see its signature, when it fires, and the production pattern around it.

class AgentSession

Constructor

Methods

Lifecycle events

State pattern

constructor

AgentSession(stt, llm, tts, vad, turn_detection)

The unified primitive. Declare STT, LLM, TTS, VAD, and a turn detector as pluggable components. The SDK wires up streaming, end-of-utterance detection, barge-in cancellation, tool-call orchestration, and OpenTelemetry hooks.

from livekit.agents import AgentSession
from livekit.plugins import deepgram, openai, cartesia, silero

session = AgentSession(
    stt=deepgram.STT(model="nova-3"),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=cartesia.TTS(voice="sonic-3"),
    vad=silero.VAD.load(),
    turn_detection="multilingual",
)

Same AgentSession runs with llm=openai.realtime.RealtimeModel() to swap to speech-to-speech — no other code changes.

async method

await session.start(agent, room)

Binds the session to a specific Agent subclass and a LiveKit room. The room’s media + data channel are attached; the agent starts listening as soon as a participant publishes audio.

class SupportAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful support agent.",
        )

await session.start(agent=SupportAgent(), room=ctx.room)

async method

await session.say(text)

Direct TTS bypass. Skips the LLM and speaks a literal string. Useful for the AI-disclosure script at session start, the consent capture line, or a filler word during a slow tool call.

await session.say(
    "Hi, this is an AI assistant. "
    "This call may be recorded for quality."
)

method

session.update_agent(new_agent)

Multi-agent handoff. Swap the Agent class mid-conversation. The session state, conversation history, and observability context survive — only the system prompt and the tool set change. The 2026 pattern for triage-then-domain workflows.

# Triage agent classifies intent
# then hands off
session.update_agent(BillingAgent())

attribute

session.userdata: T

Per-session state container. Typed dataclass you define — caller phone, account ID, conversation summary, retrieved RAG context. Tools read and write it directly; the LLM reads it via Jinja2-style template injection into the system prompt.

@dataclass
class SessionState:
    caller_phone: str
    account_id: str | None = None
    summary: str = ""

session = AgentSession[SessionState](
    userdata=SessionState(caller_phone=phone),
    …
)

worker hook

def prewarm(proc)

Worker-level cold-start fix. Loads expensive resources (model weights, vector index, embeddings) once at worker startup so the per-session join is fast. Without prewarm, the first session on a worker takes 1.5–3 s extra.

def prewarm(proc):
    proc.userdata["vad"] = silero.VAD.load()
    proc.userdata["vector_index"] = load_pinecone_index()

# proc.userdata is shared across all sessions on that worker

Workers register, jobs dispatch, AgentSession spawns — one per room. A Python worker holds 5–20 concurrent sessions per CPU core depending on plugin choice. Scale horizontally with more worker pods.

Turn detection · what holds up under real callers

Three signals stacked — the only way barge-in stops feeling robotic.

LiveKit Agents 1.5 fuses an acoustic VAD, a local semantic turn detector, and an audio-based interruption ML model. Below is a 10-second conversation with each signal’s catches highlighted. Click a signal to see what it owns.

User audio · 10 s sample

0s2s4s6s8s10s

Signal 03 · Adaptive interruption handling (1.5+)

The flagship 1.5 feature — rejects 51% of false barge-ins.

Audio-based ML model trained specifically to distinguish genuine user interruptions from incidental sounds: backchannels (“mm-hmm,” “uh-huh”), coughs, sighs, throat-clearing, background noise (TV, dog, kids, traffic). Enabled by default, no configuration required.

Reject rate51% of VAD-based barge-ins rejected (false positives avoided)

Accept latencyFaster than VAD in 64% of true barge-in cases

Production dataInterruption-misfire complaint rate dropped ~60% in Fora Soft fleet after 1.5 upgrade

P95 response latencyDropped 150–300 ms via dynamic endpointing — pauses adapt to caller rhythm

The blue interruption at 8.5s is “wait actually” — the model yields immediately. The red mark at 7.2s is “mm-hmm” — the model keeps the agent talking. Silero alone would have stopped the agent on the backchannel.

Production tuning — three knobs

min_endpointing_delay200–600 ms · minimum silence before agent treats turn as complete

interrupt_speech_duration200–500 ms · how long user must speak post-barge-in before agent yields

turn_detection"vad" · "stt" · "multilingual" / "english_v2" (recommended)

Acoustic VAD answers “is sound there.” Semantic detection answers “is the thought finished.” Adaptive interruption answers “is this actually a real interruption.” Three signals, three questions, one fused decision.

Plugin matrix · pick a stack that fits

Filter the LiveKit plugin ecosystem by what your build actually needs.

There are 8+ STT, 12+ LLM, 10+ TTS providers shipping LiveKit plugins in 2026. Pick your latency budget, cost ceiling, and compliance requirement — the matrix dims everything that doesn’t fit and surfaces the recommended stack at the bottom.

STT

Streaming speech-to-text

7 plugins

Deepgram Nova-3

first token 100–200 ms · $0.0043/min

defaultHIPAA

AssemblyAI Universal

150–300 ms · 99 languages

HIPAAmultilingual

OpenAI Whisper Realtime

200–400 ms · same-vendor with GPT

HIPAA Ent

Azure Speech

200–400 ms · 100+ langs

HIPAAFedRAMP

Google Cloud STT

150–350 ms · 125 langs

HIPAA

Whisper self-host

400–1000 ms · air-gapped

self-managed

Speechmatics

200–400 ms · UK accent strength

HIPAA

LLM

Streaming reasoning + tool calling

8 plugins

GPT-4o-mini

first token 150–250 ms · 80–120 tps

defaulttool-call 95%

GPT-4o

200–400 ms · strongest reasoning

HIPAA Enttool-call 98%

Claude Haiku 3.5

150–300 ms · vision

HIPAA

Claude Sonnet 3.5

300–500 ms · strongest reasoning + vision

HIPAAtool-call 98%

Gemini 2.5 Flash

200–400 ms · cost-tuned scale

HIPAA

Gemini 2.5 Pro

300–600 ms · 2M context

HIPAAlong-ctx

Llama / Mistral self-host

100–800 ms · air-gapped, regulated

self-managed

OpenAI Realtime (S2S)

n/a (audio in/out) · lower tool reliability

HIPAA EntS2S

TTS

Streaming text-to-speech

7 plugins

Cartesia Sonic 3

TTFB 70–150 ms · fastest streaming

defaultHIPAA

ElevenLabs Flash v3

100–200 ms · most expressive

HIPAA Entpremium voice

OpenAI TTS

200–400 ms · same-vendor stack

HIPAA Ent

Azure Neural

150–300 ms · 400+ voices · cheapest at scale

HIPAAFedRAMP

Deepgram Aura

100–200 ms · single-vendor with Deepgram STT

HIPAA

Google Cloud TTS

200–350 ms · quality stable across langs

HIPAA

PlayHT

200–400 ms · voice cloning · 800+ voices

HIPAA Entcloning

Recommended stack for current filters

STTDeepgram Nova-3

LLMGPT-4o-mini

TTSCartesia Sonic 3

Default 2026 production stack — matches every filter combination. ~$0.083 per minute.

LiveKit Cloud also ships LiveKit Inference — managed STT/LLM/TTS co-located with the SFU. Single billing surface, lowest cross-region latency. For ~80% of greenfield builds in 2026, Inference is the right default before swapping individual plugins for BYOK.

Cascaded vs S2S · the architectural fork

Weight your priorities — the architecture follows.

Four priority sliders below. Move each one to reflect how much your build cares about that dimension. The three architecture options (Cascaded, Speech-to-Speech, Hybrid) score themselves against your weighting in real time. The winner glows.

Tool-call reliability 5

doesn’t mattercritical

PII redaction at stage boundaries 5

not regulatedHIPAA / PCI mandatory

Latency target 5

2–3 s oksub-300 ms

Naturalness / prosody 5

robotic okindistinguishable

Cascaded

STT → LLM → TTS

70

Three discrete stages, each instrumentable, each swappable. Best for tool-heavy workflows + regulated PII paths. Sub-1.5 s achievable on streaming providers.

tool-call ~98% stage-boundary redaction ~1.4–1.7 s median

S2S

OpenAI Realtime / Gemini Live

62

Single multimodal model swallows audio in, emits audio out. Sub-300 ms achievable. Tool calling 75–88% on multi-arg tools — gap closing through 2026.

sub-300 ms natural prosody no stage redaction

Hybrid

Cascaded hot-path + S2S rapport

70

2026’s winning pattern. Cascaded for tool-heavy turns (auth, lookup, write). S2S for the human-feel sections (small talk, escalation). Same AgentSession; swap on the fly.

balanced 2026 default for >10K min/mo most engineering

Verdict for current weights Cascaded and Hybrid tied — rerun with stronger weights to break the tie.

The same AgentSession code runs cascaded or speech-to-speech — llm=openai.LLM() vs llm=openai.realtime.RealtimeModel() is a one-line swap. The decision is operational + product, not architectural.

MCP & tool calling · letting agents actually do work

Three patterns for tool calling in production.

A voice chatbot becomes a voice agent the moment it can take action mid-conversation. LiveKit Agents 1.5 ships local function tools + first-class MCP support. Below: the three production patterns and the tool-call reliability gradient by LLM.

Pattern 01 · local function tools

Decorate, return, done.

The simplest pattern. The function runs in your agent process — no network hop, no schema translation overhead. Best for read-only lookups your agent owns end to end.

from livekit.agents import function_tool, RunContext

@function_tool
async def get_order_status(
    ctx: RunContext,
    order_number: str,
) -> dict:
    """Look up order status by order number."""
    return await crm.orders.fetch(order_number)

Pros: Lowest latency · type safety · debugger steps right into it
Cons: Tightly coupled to the agent codebase · every tool change requires redeploy

Pattern 02 · MCP toolsets

Point at a server, get a tool catalog.

Anthropic’s Model Context Protocol won the agent-to-tool standard race. By 2026, every major SaaS (Stripe, Salesforce, HubSpot, Zendesk, GitHub, Notion, Linear, AWS, Microsoft Graph, Google Workspace) ships an MCP server.

from livekit.agents.mcp import MCPToolset

mcp_toolset = MCPToolset(
    url="https://your-mcp-server.example.com/mcp",
    headers={"Authorization": f"Bearer {api_key}"},
)

agent = Agent(
    instructions="You are a customer support agent.",
    tools=[mcp_toolset],
)

Pros: CRM team owns the CRM MCP server · tool catalog changes without redeploying the agent · pre-built servers for major SaaS
Cons: Network hop on every tool call · auth + observability cross a process boundary

Pattern 03 · hybrid (the 2026 default)

Local for hot-path, MCP for everything else.

Real production agents end up here. Latency-critical tools (lookup the caller’s account, fetch session state) stay local. Cross-team integrations (charge a card via Stripe, file a Zendesk ticket, query Salesforce) go via MCP. Same tool list, mixed sources.

agent = Agent(
    instructions="You are a customer support agent.",
    tools=[
        get_order_status,        # local function tool
        book_callback,           # local function tool
        crm_mcp_toolset,         # MCP server (CRM team owns)
        billing_mcp_toolset,     # MCP server (Finance team owns)
    ],
)

Read cheap, write expensive. Let the agent look up anything; put writes behind “yes, go ahead.”

2-second timeout + structured fallback. Every tool wrapper. “System unavailable” beats silence.

Log every tool call. EU AI Act (Article 50, August 2026) makes this non-optional.

Tool-call reliability by LLM · Fora Soft fleet, 30+ shipped agents

Schema correctness on first attempt · multi-arg tools

GPT-4o (text)

98%

Claude 3.5 Sonnet

98%

GPT-4o-mini

95%

Gemini 2.5 Flash

93%

OpenAI Realtime S2S

88%

Takeaway: cascaded GPT-4o or Claude Sonnet for any agent where tool-call reliability is critical — regulated workflows, write actions, multi-tool orchestration. S2S is closing the gap fast but still trails on multi-arg tools through early 2026.

2026 MCP catalog: Stripe, Salesforce, HubSpot, Zendesk, Intercom, Notion, Linear, GitHub, Google Calendar, Microsoft Graph, Slack, AWS, Postgres, Snowflake, Twilio — plus the internal MCP servers your own teams ship.

SIP & telephony · putting the agent on a phone number

A phone call is just another LiveKit participant.

LiveKit SIP and LiveKit Phone Numbers went GA in 2025. A voice agent now accepts a call from any phone with ~4 lines of config. Inbound, outbound, DTMF, SIP REFER, attended transfer — all first-class.

→ → → →

Inbound · step 03

LiveKit SIP matches a dispatch rule.

Dispatch rules map inbound calls to LiveKit rooms. The SIP service authenticates the trunk, matches the rule, and creates a SIP participant in a new (or existing) room. Workers watching that room get the dispatch.

from livekit.protocol.sip import (
    CreateSIPDispatchRuleRequest,
    SIPDispatchRuleIndividual,
)

dispatch_rule = sip_service.create_sip_dispatch_rule(
    CreateSIPDispatchRuleRequest(
        rule=SIPDispatchRuleIndividual(room_prefix="call-"),
        name="inbound-support",
        trunk_ids=[trunk.sip_trunk_id],
    )
)

Outbound · step 02

LiveKit SIP creates an outbound SIP participant.

One API call. Your server picks a SIP trunk, the destination phone number, and the LiveKit room. LiveKit returns an INVITE handle you can poll for state.

from livekit.protocol.sip import (
    CreateSIPParticipantRequest,
)

participant = sip_service.create_sip_participant(
    CreateSIPParticipantRequest(
        sip_trunk_id=trunk.sip_trunk_id,
        sip_call_to="+14155551234",
        room_name="outbound-followup",
        participant_identity="customer-1234",
    )
)

Telephony primitives shipped with LiveKit SIP

DTMFTouch-tones in (ID verify, IVR navigation) and out (transferring through downstream IVRs).

SIP REFERCold transfer to another phone number. LiveKit drops out; the human takes over.

Attended transferWarm transfer — agent dials the human, briefs them, then bridges the customer.

Hold & resumeMute agent audio while it queries a tool. Music-on-hold or silence. Resume cleanly.

Pricing: LiveKit Phone Numbers is competitive per-minute with Twilio but loses on per-number monthly fees under ~500 minutes/month/number. Above that, the native option simplifies the operational surface.

Cloud vs self-host · the deployment fork

Same code. Different infrastructure. Plan for the switch.

Every LiveKit project hits this fork — usually around minute 60 of the architecture call. The right answer changes by 6–12 months as the product scales. Design for reversibility from day one. The same AgentSession code runs identically on both.

Start here · What does your deployment need?

The managed option — default for the first 12–18 months of most products.

LiveKit Cloud is the managed SFU + agent dispatch + Inference platform. Free tier through Build / Ship ($50/mo) / Scale ($500/mo) / Enterprise. Autoscaling worker pools, multi-region, observability built in.

Wins whenSessions < 5K concurrent · light compliance · no K8s capacity · ship in 4–8 weeks

Pricing model$0.01/agent-active-minute + SFU participant minutes + Inference credits + LiveKit Phone Numbers

50-agent ops example50 agents × 8 hr/day ≈ $5K–$15K/month all-in (Cloud + Deepgram + GPT-4o-mini + Cartesia)

ScalingAutomatic. Worker pool grows on dispatch queue depth.

Loses when

Sessions clear 5K concurrent
HIPAA + BAA chain required across every plugin
GDPR EU-only data residency
FedRAMP / sovereign-cloud requirements
Custom server-side gating LiveKit Cloud doesn’t expose

TCO crossover · cost as a function of monthly minutes

Below 10K min/moCloud wins on TCO once engineering time is counted.

10K–100K min/moRoughly comparable. Pick by compliance + engineering capacity, not by cost.

Above 100K min/moSelf-host typically wins within 6–9 months. Disclosed savings: 30–50% per call.

The agent code does not change between Cloud and self-host. The infrastructure layer does. Keep the integration layer portable from day one, and the decision becomes operational, not architectural.

Observability & SRE · what to measure, what to page on

Seven metrics. Four alerts. Everything else is dashboards.

A voice agent failure is subtle: a 3-second LLM TTFT, a tool that returned the wrong row, a TTS that mispronounced the caller’s name. None throw exceptions. Per-turn telemetry + nightly evals are the only path to operational confidence.

LiveKit fleet · live

Prometheus :8081 · OTLP · refresh 30 s

Metric 01 · LLM TTFT (time-to-first-token)

The latency budget’s biggest single line item.

Time from end-of-user-utterance to first LLM token. Anything above 1 s breaks the conversational feel. Usually the dominant component of P95 end-to-end latency.

alert · LLM_TTFT_p95_high
  if P95(llm_ttft_ms) > 1200 for 5m
  group_by: model, region
  page: senior on-call
  remediation: check provider status, rotate to fallback model

Common causes: model-provider degradation, context bloat (forgot to summarize long sessions), cross-region routing.

Metric 02 · STT word error rate (sampled)

The signal that catches accent-mismatch + provider regressions.

Sampled WER measured against a labeled subset of production audio. Above 5% means transcription quality is degrading the LLM’s reasoning. Usually a model regression or accent mismatch.

alert · STT_WER_high
  if sampled_wer > 0.08 for 30m
  group_by: stt_provider, language
  page: ML on-call
  remediation: rotate provider or switch model variant

Metric 03 · TTS time-to-first-audio-byte

What users feel as “agent is responsive.”

Time from first LLM token to first TTS audio byte played back. Below 300 ms is the bar — above 500 ms callers start feeling the lag.

alert · TTS_TTFB_p95_high
  if P95(tts_ttfb_ms) > 500 for 5m
  group_by: tts_provider, voice
  page: on-call
  remediation: rotate TTS provider, check audio buffer size

Metric 04 · EOU (end-of-utterance) delay

Turn detection’s direct latency tax.

Time from the user actually finishing their thought to the agent recognizing the turn is complete. Above 700 ms median means turn detection is misconfigured for the caller distribution.

alert · EOU_delay_high
  if median(eou_delay_ms) > 700 for 30m
  group_by: turn_detector
  warn: on-call
  remediation: re-tune endpointing per vertical

Currently at 520 ms median — trending up. Likely a vertical-mix shift in caller distribution. Re-tune min_endpointing_delay.

Metric 05 · Interruption rate

The user-experience canary.

Percentage of agent turns interrupted by the user. Above 25% means the adaptive interruption model is miscalibrated for a new caller distribution — or the agent is being too verbose.

alert · interrupt_rate_high
  if interrupt_rate > 0.25 over 1h
  group_by: agent_class
  page: on-call
  remediation: review prompt verbosity + re-run 1.5+ ML tuning

Metric 06 · Tool-call success rate

The agent’s ability to actually take action.

Tool calls that resolved successfully (not timed out, not malformed, not fallen back). Below 90% means a downstream service is misbehaving or the LLM is generating bad tool schemas.

alert · tool_call_success_low
  if tool_call_success < 0.90 over 15m
  group_by: tool_name, mcp_server
  page: on-call
  remediation: check downstream MCP / API health

Metric 07 · P95 end-to-end latency

The integrative metric — what callers feel.

End of user utterance to first audio byte of agent reply. Combines STT, LLM TTFT, tool calls, TTS TTFB. Above 5 s P99 means conversation pacing is broken.

alert · e2e_latency_high
  if P99(e2e_latency_ms) > 5000 for 5m
  group_by: region
  page: senior on-call
  remediation: trace bisect · STT / LLM / tool / TTS

The four alerts that actually pay rent

01P95 LLM TTFT > 1.2 s sustained 5 minutesmodel provider issue or context bloat

02Tool-call success < 90% over 15 minutesfunction-call regression or downstream outage

03Interruption rate > 25% over 1 houradaptive interruption miscalibrated for new caller mix

04Worker capacity > 80% on any region 5+ minutescapacity exhaustion — autoscale lagging

Most LiveKit fleets drown in noise. These four pay rent — everything else is a dashboard you check at standup, not a page.

LiveKit Agents ships OpenTelemetry traces + Prometheus metrics on port 8081 + lifecycle hooks at every transition. Wire them to Grafana / Datadog / Honeycomb, alert on percentiles, never on means. Build the nightly eval harness in week 2 — not month 6.

Cost calculator · what production actually costs

Per-minute math, payback in days, vs human BPO.

Pick the stack tier, drag the volume slider, toggle telephony, choose the deployment mode. Monthly cost, per-call cost, and payback period update live. Numbers are 2026-indicative; real costs come in 20–30% higher with retries and ops overhead.

Stack tier

Balanced: Deepgram Nova-3 + GPT-4o-mini + Cartesia. Default 2026 production stack.

Monthly minutes 12,000

1K10K100K500K

Telephony (PSTN)

Adds LiveKit Phone Numbers or Twilio trunk on top of agent costs.

Deployment

LiveKit Cloud: managed. Below 5K concurrent, no compliance hard-stops, no Kubernetes capacity.

Monthly cost $996 at $0.083/min × 12,000 min

Per-call cost $0.25 assuming 3-min average call

Build cost (MVP) $30K 1–2 months · production-ready MVP

Saved vs $9/call BPO $35,004 / mo 4,000 calls/mo replaced · 40% containment assumption

Build payback 26 days monthly savings rate × build cost

Reality check: first production deployment typically comes in 20–30% higher than the table above. Real calls include retries, unused tokens, ops overhead, and the occasional vendor-pricing surprise. Containment rates of 40–70% on tier-1 support cap the realized savings; even at 40% containment, payback is usually under 10 weeks.

Framework decision tool · 4 questions, 6 frameworks

Answer four questions — the framework ranking updates live.

Pick the option that matches your build for each of the four dimensions below. The six frameworks (LiveKit, OpenAI Realtime API, Vapi, Retell, Pipecat, Bland) score themselves against your answers in real time. The winner glows.

01

Monthly call volume

02

Customization depth

03

Time to first call

04

What matters most

Framework fit · ranked by current answers

1 LiveKit Agents

12

2 OpenAI Realtime API

11

3 Vapi

14

4 Retell AI

10

5 Pipecat

12

6 Bland AI

13

Verdict Adjust your answers to see how the ranking shifts.

We’ll tell you on the call if your build doesn’t need us. We’ve walked away from projects that did better on Vapi or stayed on direct OpenAI Realtime — both legit answers for the right shape. The tool is a starting point; the architecture call decides.

Production patterns · what holds up under real traffic

Six patterns every shipped LiveKit agent needs.

Across 30+ shipped agents, these six patterns recur in every production build. Skip any of them and the agent regresses inside a month. Click any pattern to see the implementation + the failure mode it prevents.

01

Warm worker pools

Cold-start agent workers add 1.5–3 seconds before the first turn. Keep 2–3 pre-warmed workers per region; autoscale on dispatch queue depth, not after the queue saturates.

Implementation

LiveKit Cloud: handled automatically — scale up before queue saturation
Self-host on K8s: HPA on active-session metric, minReplicas set to peak/3
prewarm function in worker code: load VAD model + vector index at process startup

Without it

P99 join latency spikes 2–4 s on traffic ramps. Customers hear silence after dialing. Marketing pushes look like outages.

02

Tool timeouts and graceful fallbacks

Every tool wraps with a 2-second timeout and a structured fallback response the LLM can verbalize naturally. “Trouble reaching that system…” beats silence or hallucinated tool output.

Implementation

@function_tool
async def get_account(ctx, account_id: str) -> dict:
    try:
        return await asyncio.wait_for(
            crm.fetch(account_id), timeout=2.0
        )
    except (TimeoutError, ServiceUnavailable):
        return {"status": "unavailable",
                "fallback_message":
                "I'm having trouble reaching the account "
                "system. Would you prefer I take a message?"}

Without it

Tool hangs → LLM waits → user hears dead air → user hangs up. Or worse: LLM hallucinates a plausible-sounding response based on tool history.

03

PII redaction at stage boundaries

In regulated workflows, redact PII between stages. Drop credit-card digits from the STT transcript before the LLM sees them. Drop diagnosis details from the LLM response before TTS reads them.

Implementation

Regex-mask card numbers in the STT->LLM hook (on_user_speech_committed)
PCI: route card-entry digits via DTMF straight to the payment processor (LLM never sees them)
HIPAA: configurable redaction layer at the function-call boundary, persisted in audit log redacted
Both directions: anything LLM speaks back becomes recorded audio

Without it

PCI DSS violation on the first recorded card number. HIPAA breach on the first transcript leak. The audit finds it before you do.

04

Conversation summarization for long sessions

Calls longer than ~10 minutes blow LLM context limits. Every 5 turns, summarize prior conversation into a compact passage; replace older turns with the summary in the prompt. Keeps context bounded; preserves continuity.

Implementation

Trigger on turn count (every 5 turns) or token count (every 6K tokens)
Summarization model: same LLM, lightweight prompt, < 200 tokens output
Persist summary in session.userdata for replay / audit
Inject summary in system prompt via Jinja2 template

Without it

P95 LLM TTFT walks up linearly with call duration. By minute 12 the agent is at 3+ second response times. By minute 20 it crashes the context window outright.

05

Persona consistency via prompt versioning

LLM responses drift across model versions and prompt edits. Version your system prompt. Tag every turn with the prompt version. Run the eval harness on every prompt change before shipping.

Implementation

System prompts in version control, semver tagged
Prompt version emitted as an OTel span attribute on every turn
Eval harness blocks prompt-version bumps that drop persona score below threshold
Multi-agent stacks: separate prompt versions per Agent class

Without it

You changed the prompt 3 weeks ago. Customer complains today. You have no way to know which version generated the bad response. Or which version to roll back to.

06

Eval-driven model upgrades

When OpenAI ships GPT-5o or Anthropic ships Claude 4 Opus, do not just swap in production. Run the eval harness in staging; compare pass rate, latency, cost; ship only if the new model wins on all three.

Implementation

Eval harness: 100–500 canned scenarios with rubric grading
Rubric covers: tool-call correctness, factual accuracy, persona, refusal behavior, PII redaction, latency budget
Blue-green deploy with explicit rollback path
Block model bumps below threshold — even if the vendor changelog says “25% faster”

Without it

The new model is 25% faster but 6% worse at tool calling. Production-incident MTTR is 4–72 hours depending on how long the regression takes to surface in complaints. Rollback isn’t free either — user-visible.

Plus a 7th meta-pattern: instrument before launch. Per-turn telemetry + nightly eval are the only path to operational confidence on a system this complex.

Compliance · five regulatory frames every voice agent plans around

Voice agents touch more regulation than text agents.

Live audio is typically classified as biometric data. Conversations get recorded. Tool calls take actions on user systems. Five regulatory frames every production LiveKit deployment plans around. Click any zone for scope + the Fora Soft production pattern.

EU AI Act · 2 August 2026

General-purpose AI obligations land in 90 days.

If the agent serves EU users, three things become non-optional. Engineering effort to comply: 4–8 engineer-weeks for a standard-risk use case. Plan for it before the deadline.

Article 50 disclosure“You are speaking with an AI assistant.” Logged and consent-captured. Spoken in the first 5 seconds.

Logged interaction recordStructured logs with timestamps, model versions, tool calls, agent decisions. Retained per Article 50.

Foundation-model complianceDocumentation that the underlying model is compliant (vendor model card + EU AI Act conformance documentation)

Engineering cost4–8 weeks for standard-risk · longer for high-risk classifications (healthcare triage, employment, credit)

What breaks if missingInability to legally serve EU users post-August 2026. Fines up to 7% of global annual turnover.

Two-party recording-consent states (US)

California Florida Illinois Maryland Massachusetts Montana Nevada New Hampshire Pennsylvania Washington

Agent opening: “This call may be recorded for quality and AI training purposes. Press 1 or say ‘yes’ to continue, or stay on the line to opt out.”

Documentation as the audit primitive

AI disclosure script (verbatim, per language)
Recording retention policy (per jurisdiction)
Tool catalog the agent is allowed to call
Escalation rule for unhappy callers, suicidality, medical emergencies
Data residency map (which user audio lives where)

Document-then-enforce, not the other way around. That document is the regulator-ready audit trail.

Compliance is engineering, not a checklist. Build the BAA chain, the audit log, the consent capture, the disclosure script before the first customer call — then operate within them.

2026 trends · what is changing in LiveKit AI agents

Six forces reshaping the LiveKit agent stack through 2026.

Each trend lands at a different point on the quarterly horizon. Click any bar to see what changes, what to plan for, and which architectural decisions it forces. The chart reads left-to-right: today on the left, late 2026 on the right.

Q1 2026now Q2 2026 Q3 2026EU AI Act Q4 2026

Rising / maturing Becoming default Already standard Hard deadline

Trend 01 · rising

Speech-to-speech maturity for tool-calling workflows.

OpenAI Realtime (gpt-realtime) and Gemini Live tool-calling reliability is approaching cascaded-pipeline parity. Through 2025, S2S models lagged GPT-4o text on multi-tool function-calling; through 2026, the gap is closing.

Plan for:

S2S becomes the default for new voice agents below moderate compliance scope by late 2026
Cascaded stays the default where PII redaction at stage boundaries is mandatory (PCI, HIPAA)
Design AgentSession to swap llm between cascaded LLM and RealtimeModel via env config

The meta-pattern across all six: the LiveKit stack is consolidating around opinionated defaults (Inference, adaptive interruption, MCP, multi-agent). Where there used to be choice, there’s now a recommended path. Where there used to be optional, there’s now regulation.

Production examples

Four shipped AI agents across four very different shapes.

An AI sales-coach analyzing customer behavior in real time. An AI stylist running on-device on iPhone. Real-time speech-to-speech translation across 75+ languages. AI-powered video management at 99.5%+ facial-recognition accuracy. Four production builds running today — each a different shape of AI agent on top of real-time media.

Meetric · AI sales conversation intelligence

Real-time customer behavior + AI speech coaching

Swedish AI sales video platform that analyzes customer behavior during live presentations, identifies speech strengths and weaknesses, and generates post-meeting reports. Integrates with Google Meet, MS Teams, Zoom. Captures conversations across video, phone, email, chat. Outcomes: up to 25% lift in close rates, 30× coaching efficiency, 80–100% CRM data-entry automation. SEK 21M (~$2.25M) seed in 2025.

AI Wardrobe App · on-device AI stylist

YOLOv8m + CLIP recognition, fully on-device

Personalized AI stylist on iOS in the $4.3B AI-wardrobe market (projected $13.5B by 2033). Custom-trained YOLOv8m for garment object detection plus CLIP for semantic understanding — all running on-device via TensorFlowLite. Recognizes garment type, season, color, fabric, sleeve length, neckline. Generates outfit suggestions based on wardrobe + weather + occasion. No round-trip to the cloud.

TransLinguist · AI speech-to-speech translation

75+ languages · NHS UK national framework

Video conferencing platform for professional interpreters — marketplace of 30,000+ certified interpreters, estimated $4.2M annual revenue. AI speech-to-speech translation in 16+ languages with closed captioning in 22 languages. Won the NHS (NOE CPC) national framework across NHS, councils, schools, police, fire/rescue. Clients report 50% cost savings, 80% reduction in interpreting costs, 2× ROI in two years.

MindBox · AI video management system

99.5%+ facial recognition · 500K+ vehicles/day ANPR

AI-powered intelligent video management system — 50+ deployments across transport, pharma, and gated communities since 2020. Facial recognition at 99.5%+ accuracy beats Google + Facebook benchmarks, with anti-spoofing resistant to photo/video attacks. ANPR module captures license plates of 500,000+ vehicles daily across India at ~95% accuracy. Real-time anomaly alerts trigger automatic recording on intrusion, fire, or crowd buildup.

Decision framework

Build vs Buy vs Hybrid — when each one wins.

Three architectural paths for shipping a real-time AI voice or video agent. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether the platform is the product or supports the product.

Build on LiveKit

When you need control SDK won’t give you

Wins when: real-time video on the agent, LLM provider flexibility, self-host option for compliance, usage above 10K minutes/month, custom turn-taking / voice cloning / RAG, brand-embedded experience, multi-tenant SaaS plays.

Cost shape: $5K–$80K build over 1–4 months. $1K–$5K monthly operations. LiveKit Cloud $0.05–$0.18 per minute (or self-host above 100K min/mo for 30–50% savings). Archetypes: Mindwibe, Translinguist, VOLO.live, MindBox.

Buy managed

When speed matters more than customization

Wins when: under 10K minutes/month, standard call patterns, no compliance pressure, no in-house engineering capacity. Vapi for developer-first teams (~2 hours to first call). Retell for visual-builder non-technical teams. Bland for outbound-heavy TCPA workflows.

Cost shape: $0.05–$0.13 per minute (Vapi list) or $0.23–$0.50 BYOK. No upfront build cost. Vendor lock-in.

Hybrid

Managed pilot, framework scale-up

Wins when: validating voice UX before committing to a framework, running staging on Vapi / Retell while building production on LiveKit, or half the workload fits managed (the voicemail leg) while the other half needs LiveKit (the live agent leg).

Pattern: start on Vapi / Retell for 4–8 weeks to validate; build the LiveKit production version once usage clears 10K minutes/month; cut over with a portability layer so prompt and tools transfer.

Cost ranges are 2026-indicative. Implementation specifics — concurrency target, compliance scope, multimodal vs voice-only, tool-call surface, multi-region cascade — dominate the spread within each tier.

A custom LiveKit AI agent costs $5K–$80K to build over 1–4 months. The break-even point vs managed Vapi / Retell sits around 10,000 minutes per month — below, managed ships faster; above, LiveKit undercuts on per-minute economics by 60–80%.

FAQ

Twelve questions every LiveKit architecture review covers.

What is LiveKit and what are LiveKit agents?

LiveKit is an open-source WebRTC SFU paired with an agent framework (livekit-agents) for building real-time voice and video AI agents. The SFU handles media transport. The agent framework runs as a participant in the room, processes audio and video through STT, calls an LLM, and publishes responses via TTS. LiveKit Cloud is the managed hosting; self-host on Kubernetes is the alternative.

How do LiveKit AI agents work?

Five stages: (1) the agent worker joins the LiveKit room, (2) it captures audio (chunked at 20-40 ms) and video (1-4 fps) from participants, (3) it runs streaming STT, calls the LLM, and orchestrates function-calls, (4) it generates a response via streaming TTS and publishes back into the room, (5) it writes session state to memory. End-to-end latency is typically 1.2-2.5 seconds for voice-only, 1.8-4.5 seconds for multimodal.

LiveKit vs OpenAI Realtime API — which should I use?

OpenAI Realtime API is voice-only, GPT-4o-locked, and simpler to set up (2-4 weeks). LiveKit is multimodal-capable, LLM-provider-flexible, and can self-host (4-8 weeks setup). Use OpenAI direct for a single voice agent on a strong in-house team. Use LiveKit when you need real-time video, custom turn-taking, multiple LLM providers, or self-host for compliance.

Should I use LiveKit Cloud or self-host?

LiveKit Cloud below 5,000 concurrent users with no compliance constraints. Self-host above 5,000 users, or with HIPAA / SOC2 / GDPR / FedRAMP, or when you need custom server-side routing. Self-host wins on cost above ~$5K/month of Cloud spend. Design the integration layer to be portable so the agent code does not change between Cloud and self-host.

How much does it cost to build a LiveKit AI agent?

A custom LiveKit AI agent costs $30K-$200K to build over 8-20 weeks. A 2-week Architecture Sprint at $15K-$25K produces a fixed-bid quote for the build. LiveKit Cloud usage at production scale runs $0.40-$2.00 per agent-hour plus per-participant-minute. Ongoing operations run $8K-$25K per month.

What is the typical end-to-end latency for a LiveKit voice agent?

1.2-2.5 seconds median (P50) for a voice-only agent with streaming STT (Deepgram), GPT-4o, and ElevenLabs Flash for TTS. 1.8-4.5 seconds for a multimodal agent that adds vision-language model processing. Above 3.5 seconds, user engagement materially degrades. The largest single latency component is usually the LLM first-token latency (400-1,000 ms).

Can LiveKit handle real-time video AI agents?

Yes, natively. Unlike OpenAI Realtime API (voice-only) or Vapi/Retell (limited video), LiveKit’s agent runtime treats video as a first-class media stream. Video frames sample at 1-4 fps into a vision-language model (GPT-4o vision, Claude 3.5 Sonnet, Gemini). Used in production for surveillance, healthcare, education, and live events.

How do I integrate Vapi or Retell with LiveKit?

You usually do not. Vapi and Retell are managed agent platforms with their own runtime; LiveKit is a framework for building your own agent. The pattern that does work: use LiveKit as the media transport layer and call Vapi or Retell as a function-call destination from inside a LiveKit agent, when you want to delegate one specific use case to a vendor.

Is LiveKit production-ready in 2026?

Yes. LiveKit is shipped in production by hundreds of customers, including agent platforms, video-call apps, and developer tools. Fora Soft has shipped 30+ AI agents on LiveKit. The framework gets monthly releases; production deployments pin to specific versions and upgrade quarterly.

How do I scale LiveKit agents to 10,000+ concurrent sessions?

Three tactics. (1) Self-host LiveKit on Kubernetes with autoscaling on the SFU and agent-worker pools. (2) Maintain a warm pool of 2-3 idle agent workers per region to absorb cold-start latency. (3) Cascade SFUs across regions for users in different geographies. LiveKit’s open-source SFU scales linearly with bandwidth at the SFU node and with agent worker count for the AI side.

What stack do you recommend for a LiveKit voice agent?

Default 2026: LiveKit (Cloud or self-host) for media, Deepgram for streaming STT, GPT-4o for LLM, ElevenLabs Flash for TTS, Pinecone for vector memory, Postgres for structured memory, OpenTelemetry for observability, Kubernetes for deployment. Swap Whisper for Deepgram on compliance constraints. Swap Cartesia for ElevenLabs when first-token TTS latency below 200 ms is a hard requirement.

How do you handle HIPAA compliance with LiveKit?

Self-host LiveKit on BAA-able infrastructure (AWS, GCP, Azure all offer BAAs). Use a BAA-compliant LLM (OpenAI Enterprise, Azure OpenAI, AWS Bedrock). Use a BAA-compliant STT (Deepgram and AssemblyAI both offer HIPAA tiers). Use BAA-compliant TTS (ElevenLabs offers HIPAA on enterprise). Add full audit logging at the function-call layer, encrypted memory persistence, RBAC, and automatic session termination.

Where this guide goes deeper

Connected guides and references.

Each piece below extends one slice of this pillar — the WebRTC transport layer, the multimodal cross-cluster, the speech-translation specialization, the commercial path to commissioning a build, or the Fora Soft blog deep-dives.

Transport

WebRTC architecture for production systems

Read the guide →

Multimodal

Multimodal agentic AI for real-time systems

Read the multimodal guide →

Translation

Real-time speech translation architecture

Read the translation guide →

AI agents

AI video agent development guide

Read the AI agent guide →

LiveKit services

LiveKit AI agent development services

See the services page →

Blog deep-dive

LiveKit AI Voice Agents — the 2026 Playbook

Read the playbook →

Have a specific LiveKit architecture question?

Engineer-to-engineer review on the first call.

If you are scoping a LiveKit AI agent and want a second opinion on cascaded-vs-S2S, the plugin matrix, the Cloud-vs-self-host threshold, or the EU AI Act compliance approach — write us. A senior engineer with shipped LiveKit agents in production replies within 24 hours.

Book a discovery call WhatsApp the team Email eager2develop@forasoft.com

A practical guide to LiveKit for AI agents — production architecture for voice and video AI.

Voice agents, video agents, multimodal — and the LiveKit primitives that hold each one up.

Voice agents — sub-500 ms target

Video agents — vision-language on the loop

Multimodal agents — voice + video + screen + data

Phone agents — SIP-native since 2025

Where every millisecond of a LiveKit agent turn actually goes.

Worker accepts a job, joins the room, subscribes to tracks.

Audio chunked at 20–40 ms, video sampled at 1–4 fps.

STT → LLM → tool calls → (streaming response). The biggest single lever.

Streaming TTS, optional avatar, optional barge-in cancel.

Memory writes + observability fire after the turn, never on it.

Four layers, one AgentSession.

STT plugins

LLM plugins

TTS plugins

LiveKit Inference

Custom plugin path

AgentSession

Worker + Job dispatch

Turn detection

Tool calling

Multi-agent handoff

userdata pattern

Rooms + participants

Tracks

Data channel

Server SDKs

WebRTC SFU

LiveKit SIP

Egress

Ingress

One class, every voice agent.

AgentSession(stt, llm, tts, vad, turn_detection)

await session.start(agent, room)

await session.say(text)

await session.generate_reply(user_input)

session.update_agent(new_agent)

session.interrupt()

@session.on(“user_started_speaking”)

@session.on(“user_speech_committed”)

@session.on(“agent_started_speaking”)

@session.on(“function_calls_finished”)

session.userdata: T

def prewarm(proc)

Three signals stacked — the only way barge-in stops feeling robotic.

Acoustic VAD — the fastest, dumbest signal.

LiveKit’s SmolLM-v2 fine-tune holds the turn for unfinished sentences.

The flagship 1.5 feature — rejects 51% of false barge-ins.

Filter the LiveKit plugin ecosystem by what your build actually needs.

Streaming speech-to-text

Streaming reasoning + tool calling

Streaming text-to-speech

Weight your priorities — the architecture follows.

STT → LLM → TTS

OpenAI Realtime / Gemini Live

Cascaded hot-path + S2S rapport

Three patterns for tool calling in production.

Local function tools

MCP toolsets

Hybrid (the 2026 default)

Decorate, return, done.

Point at a server, get a tool catalog.

Local for hot-path, MCP for everything else.

A phone call is just another LiveKit participant.

Caller dials a phone number.

SIP provider sends INVITE to LiveKit’s SIP endpoint.

LiveKit SIP matches a dispatch rule.

Agent worker spawns an AgentSession for the room.

Conversation runs as a normal LiveKit session.

Your server initiates the outbound call.

LiveKit SIP creates an outbound SIP participant.

SIP trunk sends INVITE to the callee’s carrier.

Callee picks up — SIP participant joins room.

Agent opens with the consent script.

Same code. Different infrastructure. Plan for the switch.

LiveKit Cloud

Cloud + self-host hybrid

Self-hosted Kubernetes

The managed option — default for the first 12–18 months of most products.