
More on this topic: read our complete guide — LiveKit vs Agora Pricing: Complete 2026 Cost Analysis.
LiveKit AI voice agents are the difference between a “press 1 for support” IVR and a caller who thinks they’re talking to a person. In 2026, a production-grade agent can qualify a lead, book an appointment, or handle a tier-1 support call with sub-second feel — at roughly 5–10% of the cost of a human-staffed call centre.
This is the business-and-engineering guide we wish we’d had when Fora Soft shipped its first LiveKit agent in 2024. We cover the exact stack to pick, the latency numbers you have to hit, realistic per-minute cost ranges, the five pitfalls that kill projects, and how LiveKit compares to Vapi, Retell, Pipecat and Bland. If you’re sizing a rollout, skip to the 6-week mini-case or the buy-vs-build framework.
Thinking about a voice agent for your product?
Free 30-minute call with a Fora Soft engineer who has shipped LiveKit agents in production. We’ll size the rollout, pick a stack, and give you a realistic per-minute cost.
Why product owners pick LiveKit for voice agents in 2026
LiveKit is the open-source WebRTC stack that Meta, OpenAI (for ChatGPT Voice), Character.ai, and thousands of smaller vendors run their real-time audio on. LiveKit Agents, the framework we’re covering here, went 1.0 in April 2025 and as of April 2026 is on Python 1.5.x with adaptive interruption handling and native Model Context Protocol (MCP) tool support.
Three reasons product owners pick LiveKit over a managed platform like Vapi or Retell:
- Cost at scale. Above roughly 10,000 minutes/month, the framework path undercuts managed platforms by 60–80% per call. Below that, managed is cheaper once you count eng time.
- Vendor freedom. You bring your own STT (Deepgram, AssemblyAI, Whisper), LLM (Claude, GPT, Gemini, open models), and TTS (Cartesia, ElevenLabs, Azure). No lock-in.
- Telephony is first-party. LiveKit shipped native SIP and Phone Numbers in 2025, so inbound and outbound calling no longer needs a Twilio bridge.
We’ve shipped LiveKit agents for customer support, outbound qualification, in-app voice companions, and one regulated outbound-collections workflow. The patterns below are what held up under production traffic.
What a LiveKit AI voice agent actually is
A LiveKit agent is a process, not a chatbot. It joins a LiveKit “room” as a participant — exactly the way a human would if they dialed in — subscribes to the caller’s audio track, runs that audio through ASR, feeds the text to an LLM (optionally with tool-calling), and publishes the LLM’s response back as synthesized speech on its own audio track. It’s real-time, bi-directional, and fully parallel.
Why “process in a room” matters
Because the agent is just another participant, you can drop a human into the same room to take over, record the whole conversation, run a second supervisor agent that watches the first, or bridge to a SIP phone line — without changing your architecture. That flexibility is why LiveKit beats custom bespoke stacks on iteration speed.
The core primitive is AgentSession, which unified the older VoicePipelineAgent and MultimodalAgent abstractions into one orchestrator in the 1.0 release. You declare STT, LLM, and TTS as pluggable components, register tool functions, and the SDK wires up streaming, turn-detection, and interruption handling.
If you’re new to the underlying real-time stack, our WebRTC explainer covers the transport that makes all of this possible. LiveKit is a higher-level abstraction built on top of WebRTC that handles the server-side SFU you’d otherwise have to operate yourself.
The latency budget that separates useful from unusable
Every voice agent lives or dies on one number: time from the user finishing their turn to the first syllable of the agent’s response. Below 300ms it feels human. 300–600ms feels sluggish but acceptable. Above 600ms users revert to touch-tone mental models and start tapping keys. Above 1.5s they hang up.
Here’s the budget that hits sub-500ms perceived latency in production:
| Stage | Target p50 | Target p95 | Who owns it |
|---|---|---|---|
| Endpointing / VAD | 80ms | 160ms | Semantic VAD model |
| Final ASR transcript | 120ms | 250ms | Deepgram / Cartesia |
| LLM first token | 180ms | 400ms | OpenAI / Claude / Gemini |
| TTS first audio | 70ms | 140ms | Cartesia / ElevenLabs |
| Perceived total | ~450ms | ~950ms | The whole pipeline |
A hard truth from 2025–2026 production data: published industry medians across millions of real calls sit around 1.4–1.7s, and p99 runs 3–5s. The 450ms number above is achievable — but only with streaming at every stage, co-located regions, pre-warmed model contexts, and a disciplined observability practice. It won’t happen by default.
The big gotcha is first-token LLM latency. A model that feels fast in chat (1s first token) is two-thirds of your budget in voice. Pick a model that streams the first token in under 300ms, even if its reasoning is slightly weaker.
LiveKit Agents 1.x: the 2026 framework in one diagram
At runtime the pipeline is four layers, each running concurrently so the user never waits for a previous stage to finish.
CALLER AUDIO ──> LiveKit Room ──> AgentSession
|
+-------------------------+----------------------+
| | | |
VAD / Turn Streaming Streaming Streaming
Detection ASR LLM TTS
| | | |
endpoint partial + first token first audio
decision final + tool calls frame
transcript
| | | |
+-------------------------+----------------------+
|
LiveKit Room ──> CALLER AUDIO
The important pieces inside AgentSession:
- Worker. A long-lived process that dispatches one Job per incoming room. One worker can run dozens of concurrent sessions.
- Job. A single agent-on-a-call lifecycle. Each Job has its own LLM context, STT stream, and TTS buffer.
- Plugins. Drop-in implementations of STT, LLM, TTS, and VAD. Swapping Deepgram for AssemblyAI is a one-line change.
- Tool registration. Decorated Python functions become callable by the LLM mid-turn, via OpenAI tool-calling or MCP.
The streaming transport is the same WebRTC that LiveKit uses for human video calls. That means the agent can also join a video conference, watch a screen share, and answer questions about what’s on screen — the same pattern we use for AI features on top of existing video platforms, as covered in our ChatGPT streaming integration guide.
Speech-to-speech vs cascade: which pipeline to pick
There are two architectural options in 2026, and most production agents now combine them.
Cascade (STT → LLM → TTS). The traditional pipeline. Three separate vendors, three separate models, three separate logs. More moving parts, but you can pick the best model at each layer, redact PII between stages, and swap a vendor without a rewrite. This is what 90% of LiveKit production agents still use in 2026.
Native speech-to-speech (S2S). OpenAI’s Realtime API (gpt-realtime / gpt-4o Realtime) and Google’s Gemini 2.5 Live take audio in and emit audio out — no explicit text stage. End-to-end latency drops to 320–800ms, and the voice sounds more natural because pauses and prosody are preserved. The trade-offs: less predictable cost, harder to log and redact, single-vendor lock-in.
Hybrid is the winning 2026 pattern
Use S2S for the natural-feeling small-talk portions of a call (greetings, rapport, clarifications). Drop to cascade when the agent needs to call a tool (because tool-calling is more reliable on text-mode LLMs), then return to S2S for the response. LiveKit’s session manager handles the switch.
If your use case involves PII redaction (healthcare, finance), start with cascade. If it involves long conversational flows with no tool calls (coaching, companions, storytelling), start with S2S.
Turn detection, VAD and barge-in that feel natural
The single most common complaint about bad voice agents is interruption handling. Either the agent barrels on after the user starts speaking (robotic), or it stops on every breath (annoying). The 2026 answer is a two-signal model:
- Acoustic VAD (Silero). <1ms inference per audio chunk. Detects whether someone is speaking. Fast but naive — can’t tell “umm” from end-of-turn.
- Semantic turn detection. LiveKit ships a ~135M-parameter SmolLM-v2 fine-tune that runs locally and predicts whether the current transcript looks like a finished thought. Combines with acoustic VAD to give natural pacing.
Barge-in (user interrupts the agent mid-sentence) is handled by the runtime: when VAD fires on the user’s track while the agent is speaking, TTS is cancelled, the interrupted LLM turn is rolled back, and the user’s new input is processed.
One practical tip: tune the VAD silence threshold to your vertical. Sales calls want ~400ms silence before end-of-turn (users think out loud). IVR replacement wants ~250ms (users are purposeful). Healthcare intake wants ~600ms (users are older, pauses are longer). A single default will feel wrong for at least two of the three.
Tool-use: letting the agent actually do the work
The difference between a voice chatbot and a voice agent is the ability to take action mid-conversation. In LiveKit Agents, tools are ordinary Python functions decorated to expose them to the LLM:
from livekit.agents import llm
@llm.ai_callable(description="Look up order status by order number")
async def get_order_status(order_number: str) -> dict:
return await crm.orders.fetch(order_number)
@llm.ai_callable(description="Schedule a follow-up call")
async def book_callback(phone: str, iso_time: str) -> str:
return await scheduler.book(phone, iso_time)
During a turn, the LLM can emit a tool call, the runtime executes the function, streams the result back into the LLM’s context, and the LLM continues with the updated information. The caller hears a brief filler (“one moment, let me check that”) so the pause feels intentional, not laggy.
Three production patterns we recommend:
- Read tools are cheap, write tools are expensive. Let the agent look up anything it wants. Put writes (sending email, charging a card, cancelling an appointment) behind explicit user confirmation.
- Tools fail. Design for it. Every tool wrapper should time out after ~2 seconds and return a graceful “system unavailable” message that the LLM can verbalize naturally.
- Log every tool call. For debugging, evals, and regulatory audit trails. This becomes non-optional under the EU AI Act logging requirements.
Our voice AI agents guide walks through a complete tool-using agent step by step, including confirmation patterns and the prompt shape that keeps tool-calls reliable.
Need a working prototype this month?
We routinely ship a LiveKit voice-agent pilot (tool-use, live telephony, observability) in 4–6 weeks. Free scoping call with a senior engineer.
SIP and telephony: putting the agent on a phone number
Until mid-2025, getting a LiveKit agent onto an actual PSTN phone line required bridging through Twilio or Telnyx with some fiddly SIP glue. LiveKit SIP went GA and LiveKit Phone Numbers shipped in 2025, which means a voice agent can now accept a call from any phone on Earth with roughly four lines of configuration.
For inbound calls, the pattern is: point a SIP trunk (LiveKit’s own, Telnyx, or any provider) at LiveKit’s SIP endpoint; the trunk forwards the call into a room; the agent worker spawns a job on that room. For outbound calls, the agent initiates the SIP INVITE via LiveKit’s server API. Both paths are documented in the LiveKit docs and code-samples are available in the agents-python repository.
Pricing note: LiveKit Phone Numbers is competitive with Twilio per-minute but loses on per-number monthly fees in low-volume cases. If you’re under ~500 minutes/month per number, Twilio or a cheaper per-number provider + SIP trunk remains the better option. Above that, native LiveKit Phone Numbers is simpler.
Vendor matrix: ASR, LLM and TTS in 2026
Our April 2026 defaults for a production English-language agent, with alternatives for common edge cases.
| Layer | Default pick | Why | Swap to… |
|---|---|---|---|
| ASR | Deepgram Nova 3 | <150ms final, ~$0.01/min | AssemblyAI (multilingual), Whisper (self-host) |
| LLM | GPT-4o mini / Claude Haiku | Sub-200ms first token, strong tool-use | Sonnet (harder reasoning), Gemini 2.5 Flash (cheap scale) |
| TTS | Cartesia Sonic 3 | <100ms first audio, ~$0.03/min | ElevenLabs (quality), Azure Neural (price) |
| VAD | LiveKit semantic + Silero | Best turn detection available | Deepgram’s built-in endpointing for single-vendor |
| S2S | OpenAI Realtime (gpt-realtime) | Most mature, wide voice library | Gemini 2.5 Live (long context) |
Deploying: LiveKit Cloud vs self-hosted
Three deployment modes to choose from:
- LiveKit Cloud. Managed SFU, agent dispatch, observability dashboard, global PoPs. The “just works” option. Fastest to MVP.
- Self-hosted LiveKit server + cloud agents. You run the SFU on your own Kubernetes or ECS; agent workers still live wherever. Good fit if you already operate real-time video infrastructure.
- Fully self-hosted via a SIP partner. LiveKit + Telnyx or Wavix, no LiveKit Cloud. Disclosed savings of ~50% per call for high-volume deployments. Requires an ops team.
For almost every project below 100k minutes/month, LiveKit Cloud wins on total cost of ownership once engineering time is counted. Above that, the self-hosted path starts to pay for itself in 6–9 months.
Autoscaling in all three modes is driven by Worker concurrency: each worker holds N sessions, and you add workers linearly with call volume. Plan for bursts — a 5x spike from a marketing push is common — by keeping a pool of pre-warmed workers so cold-start doesn’t show up in p99 latency.
Cost model: what a production call actually costs
Here are real per-minute numbers from production deployments we’ve seen in 2025–2026. Your mileage will vary by voice verbosity and LLM token usage.
| Component | Budget stack | Balanced stack | Premium stack |
|---|---|---|---|
| LiveKit session | $0.010 | $0.010 | $0.010 |
| ASR | $0.010 (Deepgram) | $0.010 (Deepgram) | $0.015 (AssemblyAI) |
| LLM | $0.008 (Gemini Flash) | $0.020 (GPT-4o mini) | $0.050 (Claude Sonnet) |
| TTS | $0.015 (Azure Neural) | $0.030 (Cartesia) | $0.090 (ElevenLabs) |
| Telephony (if used) | $0.010 | $0.013 | $0.015 |
| Total / min | ~$0.05 | ~$0.08 | ~$0.18 |
Against a typical $7–12 human-BPO per-call cost, even the premium stack is roughly 40x cheaper. Reality check: expect your first production deployment to come in 20–30% higher than the table above because real calls include retries, unused tokens, and ops overhead.
For a fuller view of how we price this kind of work end-to-end, the Fora Soft software estimating guide covers the three-number method we use for voice-agent builds.
Observability and evals: you can’t ship what you can’t measure
The single biggest cause of voice-agent rollback we’ve seen is a team shipping without traces. A bad turn is usually invisible in logs — you need the audio plus the transcripts plus the LLM response plus the tool result to understand what the agent did wrong.
The 2026 observability stack that works:
- Turn-level traces. LiveKit’s OpenTelemetry hooks emit one span per turn with ASR, LLM, TTS, and tool-call timings. Wire this to your existing APM.
- Call recordings. Dual-track audio archived to S3 or equivalent, with retention aligned to your compliance (30 days for most, 7 years for finance).
- Eval harness. A nightly job replays 100–500 canned scenarios through the agent and grades the responses against a rubric. Prevents silent regressions when you swap a model version.
- Error taxonomy. Every failed turn gets a label: tool-timeout, hallucinated-fact, barge-in-misfire, etc. Tracks trend over time.
Ship the eval before the first customer call
A voice agent without regression evals will silently degrade when any vendor updates a model. Build the eval harness in week 2, not month 6. 100 scenarios, 20 golden transcripts, one CI job.
Use cases shipping in 2026
Categories where LiveKit voice agents are in live production at scale as of Q2 2026:
- Tier-1 customer support. Refunds, returns, password resets, basic troubleshooting. Containment rates of 40–70% on narrow domains.
- Appointment booking. Dental, automotive, salon, veterinary. 24/7 intake with calendar integration and reminders.
- Outbound qualification. B2B lead follow-up, BANT scoring, schedule human rep for promising ones. Replaces SDR power-dial.
- Collections (soft). Payment reminders, plan setup, account updates. Regulatory-heavy — requires TCPA compliance.
- Healthcare intake. Pre-visit forms, symptom capture, insurance verification. HIPAA scope requires a Business Associate Agreement with every vendor.
- In-app voice companions. Embedded in mobile and web apps for coaching, tutoring, and accessibility. Shorter calls, higher concurrency.
- In-car assistants. Increasingly LiveKit-powered after several 2025 OEM wins.
Use cases that still struggle in 2026: emotionally sensitive long-form dialogue (therapy, grief), multi-language round-trips with code-switching, and any write-action where a hallucination is expensive (wire transfers, policy cancellation).
LiveKit vs Vapi vs Retell vs Pipecat vs Bland
A 2026 comparison across the five frameworks/platforms most product teams evaluate.
| Option | Type | Time to 1st call | Cost / min | Best fit |
|---|---|---|---|---|
| LiveKit Agents | Open-source framework | 2–6 weeks | $0.05–0.18 | 10k+ min/mo, custom integrations |
| Vapi | Managed, code-first | 2–3 hours | $0.05–0.13 | <10k min/mo, fast MVP |
| Retell AI | Managed, visual builder | 3–6 hours | $0.06–0.15 | Non-technical owners, <20k min/mo |
| Pipecat | Open-source framework | 2–6 weeks | $0.04–0.17 | Custom orchestration, video+voice |
| Bland AI | Managed telephony | 1–2 days | $0.08–0.20 | Regulated outbound, TCPA-heavy |
Our short-form decision rule: pick Vapi or Retell for anything below ~10k minutes/month while you validate; switch to LiveKit Agents (or Pipecat if you need tighter video+voice coupling) once volume or customization crosses that threshold.
Compliance: EU AI Act, consent, and PCI
Three compliance areas that voice-agent projects repeatedly underscope:
- EU AI Act — 2 August 2026. General-purpose AI obligations land on that date. If your agent serves EU users, you need an AI-disclosure at the start of the call (“You are speaking with an AI assistant”), a logged record of the interaction, and documentation of the foundation-model provider’s compliance evidence.
- Call recording consent. US two-party states (California, Florida, Illinois, Massachusetts, Montana, Nevada, New Hampshire, Pennsylvania, Washington) require both parties to consent. GDPR in the EU requires a lawful basis plus a right-to-erasure path for audio and transcripts.
- PCI DSS. If your agent ever takes a card number by voice, you need in-call DTMF-or-audio redaction before the data hits the LLM. Several vendors (CrescentMedia, PCI Pal, Syntec) ship drop-in pause-resume patterns.
Practical note: document-then-enforce. Before the first customer call, write down (a) the AI-disclosure script, (b) the recording-retention policy, (c) the tools the agent is allowed to call, (d) the escalation rule. That document is your regulator-ready audit trail.
Mini-case: a 6-week LiveKit support agent rollout
One of our clients, a mid-market SaaS company running ~12,000 inbound support calls per month through a human team, asked us to pilot a LiveKit agent for the first-response layer. The goal: deflect 40% of calls without degrading CSAT. Here’s what the six weeks looked like.
| Week | Milestone | Outcome |
|---|---|---|
| 1 | Scope 20 intents, write eval transcripts | 100 scenarios, 20 golden turns |
| 2 | AgentSession + Deepgram + GPT-4o mini + Cartesia | p95 latency 780ms |
| 3 | Tool-use (CRM read, ticket create) + eval CI | Eval pass-rate 82% |
| 4 | SIP trunk, AI disclosure, recording pipeline | First live call with 5% traffic |
| 5 | Prompt tuning, escalation rules, barge-in calibration | 20% traffic, CSAT parity with humans |
| 6 | 50% traffic, dashboards, incident runbook | 47% containment, cost ~$0.09/min |
Total external build cost: ~$72,000 across six weeks (two senior engineers, one designer for the disclosure UX). Estimated annualized savings against the previous human-only ops: ~$420,000 after accounting for the ~$8,000/month ongoing vendor stack. Payback under 10 weeks.
Agent engineering inside Fora Soft shaves another ~25% off build time on projects like this because most of the scaffolding (eval harness, trace pipeline, disclosure UI, SIP wiring) is now reusable across clients.
Decision framework: buy, build, or hybrid
Five questions decide whether you should build on LiveKit, buy a managed platform, or run a hybrid:
- Volume. Above ~10k minutes/month, LiveKit wins on TCO. Below, managed wins.
- Integration depth. Does the agent need to call bespoke internal APIs, not standard CRMs? If yes, LiveKit.
- Latency ceiling. Is sub-500ms perceived latency the product? If yes, LiveKit gives you the most control over each stage.
- Compliance. Regulated verticals (health, finance, legal) need auditable logging, PII redaction, BAAs. LiveKit makes that easier.
- Team. Do you have a Python-shop engineering team that can ship and operate async streaming code? If not, managed is safer.
Three yeses out of five is usually enough to justify building on LiveKit. If you’re at one or two, start managed, measure volume for a quarter, then revisit.
Not sure which path fits your product?
30 minutes on a call with us and you’ll leave with a clear buy-vs-build recommendation, a stack shortlist, and a per-minute cost estimate for your expected volume.
Five pitfalls that kill LiveKit voice-agent projects
Every rollback we’ve diagnosed has one or more of these. If you can rule them out early, you’ll ship.
- No eval harness until after launch. A silent regression in a vendor model will break your agent overnight. Build evals in week 2.
- Over-engineering turn detection. Teams try to tune interruption heuristics manually instead of using LiveKit’s semantic VAD out of the box. Use the defaults, then tune only the silence threshold per vertical.
- LLM picked for reasoning, not latency. A model with 1.2s first-token latency blows the budget. Pick the lowest-latency model that clears the quality bar, not the smartest.
- Write tools without confirmation. An agent that can send an email or charge a card without a readback is a liability incident waiting to happen. Always confirm.
- Compliance deferred to legal. AI disclosure, recording consent, and PII logging are engineering work. If legal owns them, they ship after the rollback.
If you’re planning to ship a voice feature inside a mobile app, the App Store review layer adds another pitfall: Apple and Google now flag AI features in review. Our AI video analytics case study covers how we handled similar disclosure patterns for an e-learning product.
KPIs that tell you the agent is working
Watch these weekly. If the leading KPIs trend the right direction and the lagging KPIs don’t, the agent is a toy — not a product.
| Category | Leading KPI | Lagging KPI |
|---|---|---|
| Performance | ASR-to-first-audio p95 | Call abandonment rate |
| Quality | Eval pass-rate (100 scenarios) | CSAT delta vs human baseline |
| Containment | Tool calls per call | Human-handoff rate |
| Cost | Cost per call (rolling 7-day) | Cost per resolved incident |
| Safety | Hallucination flag rate | Customer complaint ratio |
When to not build a LiveKit voice agent
LiveKit voice agents are powerful, but they’re not the answer to every customer-interaction problem. Don’t build if:
- Your call volume is under 500 minutes/month. A managed platform or a better FAQ page will ship faster.
- Your users are primarily elderly or otherwise unfamiliar with voice AI — without clear signposting they’ll mistrust the interaction.
- Every call requires a human empathetic touch (grief counselling, crisis support). Voice AI can hurt more than help.
- Your product already has a well-adopted chat channel with sub-10-second human response. Voice agents solve a latency problem you don’t have.
- Your legal team has not signed off on AI disclosure, consent, or recording retention for your jurisdictions. Build that first.
Saying no to a voice-agent project with evidence is a better outcome than saying yes and rolling back six months later.
FAQ
How long does it take to build a first LiveKit voice agent?
For a well-scoped tier-1 support agent with CRM tool-use and telephony, 4–6 weeks from kickoff to first live call with a senior two-engineer team. Add 2–3 weeks for regulated verticals that need BAAs or PCI pauses.
Can LiveKit agents handle multiple languages?
Yes, with caveats. English, Spanish, French, German, and Mandarin have mature ASR and TTS support. Lower-resource languages work but with higher latency and weaker turn detection. Code-switching mid-call (Spanglish) is still fragile in 2026.
Does a LiveKit agent work for outbound calls too?
Yes. LiveKit SIP supports outbound INVITE from the agent worker. Just watch TCPA and GDPR rules for unsolicited outbound — the regulatory exposure is much higher than inbound.
What’s the cheapest way to experiment before committing?
Spin up a LiveKit Cloud free tier, run the agent quickstart in Python, and point your own phone at it via a cheap SIP provider. Two days of tinkering will tell you more than a month of PRD writing.
Do we need GPUs to run LiveKit agents?
No, unless you’re self-hosting the ASR or TTS models. The agent worker itself is CPU-bound. All the heavy AI inference happens at your ASR/LLM/TTS vendors. If you self-host Whisper or a TTS model, yes, GPU is required.
How do we prevent the agent from hallucinating about our product?
Three layers: (1) tight system prompt with explicit “say I don’t know” fallback, (2) RAG over product docs surfaced as a tool, and (3) an eval harness that specifically tests for fabrication. Don’t rely on any single one.
Is LiveKit open source? Can we self-host everything?
Yes. LiveKit Server and LiveKit Agents are Apache 2.0. LiveKit Cloud is the paid managed version. You can run the entire stack on your own Kubernetes cluster if you have the ops capacity, but budget 3–4 weeks of infra work just to get to parity with Cloud.
How does the EU AI Act affect our voice-agent rollout?
From 2 August 2026, if EU users are in scope, you must disclose that the caller is talking to an AI, log the interaction for audit, and have documentation of your underlying model provider’s compliance. Expect 4–8 engineer-weeks to stand up the audit trail and disclosure UI for a standard-risk use case.
What to read next
Voice AI
Build voice AI that sounds human with LiveKit
Deeper-dive companion: prompt shapes, turn handling, and voice-agent UX.
Streaming
ChatGPT streaming integration guide
Adding a conversational AI layer to existing WebRTC video pipelines.
Real-time
What is WebRTC
The transport layer under LiveKit, in one primer.
Estimating
How Fora Soft estimates software
The three-number method we use to size voice-agent projects.
Costs
Mobile app development costs guide
Realistic 2025–2026 numbers for AI-first and voice-first mobile builds.
Scaling
Scalable video streaming app challenges
Lessons that transfer directly to scaling a real-time voice workload.
Sum-up: ship your first LiveKit voice agent in 2026
Key takeaways
- Target sub-500ms perceived latency. It’s the line between “useful” and “unusable.” Pick models and vendors accordingly.
- Default to cascade, reach for S2S only where naturalness is the product. Hybrid is the 2026 pattern.
- Use LiveKit’s semantic VAD, don’t reinvent turn detection. Tune one parameter per vertical.
- Ship evals in week 2, not month 6. Every voice-agent disaster traces back to missing observability.
- Below ~10k min/mo, buy Vapi or Retell. Above, build on LiveKit. Switch when volume justifies the engineering investment.
- EU AI Act deadline is 2 August 2026. Disclosure, logging, and provider-evidence need to exist before the first EU-served call that day.
- Expected per-minute cost: $0.05–0.18 blended, dropping 40x vs human BPO. Payback is typically under 3 months at meaningful volume.
- The five pitfalls that kill projects all trace to skipped evals, skipped compliance, or optimizing the wrong number. Avoid them by doing the unglamorous work first.
Voice AI in 2026 is no longer a research preview — it’s a deployable feature with known unit economics, proven vendors, and a clear regulatory deadline. The gap between teams shipping it and teams still evaluating is already measurable in cost-per-call, CSAT, and engineering-hiring leverage. The fastest way across that gap is a 6-week pilot on real traffic, not a six-month PRD.
If you want help scoping yours, Fora Soft has shipped LiveKit voice agents for customer support, outbound, in-app, and regulated verticals since the framework’s 0.x days. We’ll bring the scorecard, the stack choices, and the realistic estimates to your first meeting.
Ship your first LiveKit voice agent in 6 weeks
Free 30-minute scoping call with a Fora Soft senior engineer. You’ll leave with a stack, a cost estimate, and a realistic timeline.


.avif)

Comments