
Voice AI that actually sounds human is a latency problem first and a model problem second. Get the full round trip — microphone to speaker — under 800 ms on the pipeline stack or 300 ms on a speech-to-speech model, and the conversation feels natural. Miss that budget, and every caller notices. LiveKit Agents 1.x is the open-source framework most serious teams reach for to hit those numbers without locking into a single vendor.
This playbook is the condensed version of what Fora Soft teaches new engineers on day one of a voice-AI project. We ship LiveKit agents for support desks, sales dialers, clinical intake, and in-app assistants — and we’ve watched every latency trap, every cost blowout, every compliance surprise at least twice. The target reader is a founder, CTO, or senior engineer who needs a real answer to “should we build this, and on what?” by Friday.
Key takeaways
- Pipeline (STT → LLM → TTS) gives you control and costs $0.05–$0.15 per minute all-in; speech-to-speech gives you 300 ms latency but locks you to one vendor.
- LiveKit Agents is the right choice when you need custom flows, observability, or more than 10,000 minutes a month. Below that, Retell or Vapi ship faster.
- The provider sweet spot in 2026: Deepgram Nova-3 for STT, Claude Haiku 4.5 for the LLM, Cartesia Sonic-3 or Deepgram Aura-2 for TTS — total latency 550–700 ms.
- Krisp noise cancellation is now metered from 1 May 2026. Plan $0.002–$0.004 per minute into your budget or turn it off for clean-room audio.
- Compliance is on you, not the platform. TCPA, HIPAA, two-party recording consent, and GDPR all apply the moment you dial out.
Why teams pick LiveKit for voice AI in 2026
Every voice-AI team eventually hits the same fork. Either you bolt your agent onto someone else’s managed platform (Vapi, Retell, Bland, Synthflow) and accept their latency, their prompt templates, and their margin, or you build on a realtime transport you control. LiveKit is the second path — an Apache-2.0 realtime media stack with a production-grade Agents framework on top.
What you get out of the box: WebRTC between your users and the agent, HTTP/WebSocket between the agent and your backend, plug-in STT/LLM/TTS providers, a tuned turn-detection model, licensed Krisp noise cancellation, per-turn latency metrics, call recording, and PSTN bridging through Telnyx or Twilio. Open source, no vendor lock-in, community plug-ins for every major provider.
What you trade: you write Python or Node, you own the ops burden if you self-host, and you budget one to three engineering weeks for a production-ready agent instead of the three hours Retell advertises. For anything past a proof of concept, that trade is the right one — we’ll show you why in the cost breakdown below.
What “human-sounding” actually means
Humans expect a response gap of 200–300 ms in natural conversation. Cross 500 ms and callers consciously notice the delay. Cross a second and they start interrupting, hanging up, or pressing zero for a human. That is the real performance envelope for voice AI — not model quality, not voice realism, not tool coverage.
“Human-sounding” is four things stacked together:
- Fast first response. 300–800 ms from end-of-user-speech to start-of-agent-audio, depending on architecture.
- Graceful interruption. The agent stops speaking within 120–200 ms when the caller talks over it, and resumes the right context.
- Prosody that matches intent. Emphasis on the right words, breath groups that match semantic structure, expressive voices for empathy-heavy moments.
- Factual grounding. The agent knows when to say “let me check” and actually checks — via function calling, not hallucination.
Miss any of the four and the illusion breaks. Most teams focus on item three (“we need a better voice”) when their real gap is item one. ElevenLabs is not the bottleneck at 75 ms TTFB; your LLM at 1.2 s is.
The 800 ms latency budget — and where it leaks
The classic pipeline runs five stages. Here is a realistic best-case budget with 2026 providers, measured end-to-end on a US-region deployment:
| Stage | Best case | Typical | Where it leaks |
|---|---|---|---|
| VAD & end-of-turn detect | 50 ms | 80–150 ms | Slow/noisy speech, accents, VAD model tuning |
| STT transcription (streaming) | 150 ms | 200–300 ms | Non-streaming, cross-region, Whisper-grade batch |
| LLM TTFT | 400 ms | 600–1,200 ms | Long prompts, large context, cold provider, no streaming |
| TTS first-byte | 75 ms | 150–250 ms | No streaming TTS, expressive voices, low-traffic regions |
| Network & playback | 50 ms | 80–200 ms | Mobile radio, PSTN hop, far-away TURN server |
| Total (pipeline) | ~725 ms | 1.1–1.7 s | |
| Speech-to-speech (Realtime / Gemini Live) | 200 ms | 300–500 ms | Long context, PSTN hop, cold start |
Figure 1. Voice-AI latency budget, pipeline vs speech-to-speech (2026 benchmarks, US-region).
The two fastest leaks in every production deploy we audit are LLM TTFT and cross-region network hops. Both are fixable. Pick a low-TTFT model, pin the STT/LLM/TTS providers to the same cloud region, and stream everything. If you still miss 800 ms after that, the problem is prompt length, not infrastructure.
Reach for pipeline when…
You need custom tool calling, best-in-class STT accuracy, cost control at scale, swappable components per feature, or compliance that forbids sending audio to a single third-party model.
Pipeline vs speech-to-speech: which fits your product
Speech-to-speech models (OpenAI Realtime, Gemini 3.1 Flash Live) skip the transcription step entirely: audio goes in, audio comes out. Round trip can hit 200 ms. They sound uncannily natural and handle barge-in with zero extra code. The catch is that you buy everything — reasoning, voice, interruption, tool calling — as a bundle from one vendor.
The pipeline is the opposite trade. You stitch three to four providers, eat 500–700 ms of overhead, and in exchange you can swap the LLM when Claude ships a new model, point STT at a vertical-specific provider (Deepgram Medical, AssemblyAI for entities), use a cloned brand voice on TTS, and log every stage independently for audit.
Our practical rule: go speech-to-speech for consumer-facing assistants where personality beats precision. Go pipeline for anything that talks to a database, quotes a price, or could land you in court if it hallucinates. LiveKit supports both — the pipeline path is just the one it was originally designed for, and still optimizes best.
Reference architecture on LiveKit Agents 1.x
A production LiveKit agent has five moving parts and one traffic pattern. The user joins a LiveKit room. An agent worker picks up the job, subscribes to the user’s audio track, streams it through STT, passes partial transcripts to the LLM, streams LLM tokens into TTS, and publishes the TTS audio back into the room — all while a turn-detection model decides when the user has finished talking.
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, cli
from livekit.plugins import deepgram, anthropic, cartesia, silero
async def entrypoint(ctx: JobContext):
await ctx.connect()
session = AgentSession(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-3", language="en-US"),
llm=anthropic.LLM(model="claude-haiku-4-5"),
tts=cartesia.TTS(model="sonic-3", voice="professional-warm"),
turn_detection="livekit", # trained turn model
)
agent = Agent(
instructions="You are a polite scheduling assistant. "
"Always call check_availability before suggesting a time.",
tools=[check_availability, book_meeting],
)
await session.start(agent=agent, room=ctx.room)
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
That is the full shape. LiveKit’s dispatcher hands the job to an idle worker, the AgentSession wires the plug-ins together, and the livekit turn-detection model decides when the caller is done talking. Everything else is business logic — tools, prompts, state machines, handoffs to humans. For a deeper walkthrough of the same pattern, see our LiveKit AI agents guide.
Need a voice agent shipped, not a research project?
Fora Soft has built LiveKit agents for EdTech, healthcare intake, outbound sales, and B2B support desks. We hit sub-700 ms latency and production compliance in 3–6 weeks, not 3–6 months.
Book a 30-minute scoping call →STT: Deepgram vs AssemblyAI vs Gladia vs Soniox
Speech-to-text is the one stage where pure latency is cheap and accuracy is everything. A 150 ms vs 300 ms STT makes little difference in perceived conversation; a 5% WER vs 12% WER difference means the difference between “it worked” and “it called the wrong customer.”
| Provider | Latency | Accuracy (en) | Price | Best for |
|---|---|---|---|---|
| Deepgram Nova-3 | <300 ms | 5.26% WER | $0.0043/min | General purpose, cheapest |
| AssemblyAI Universal-3 Pro | P50 150 ms | 5.65% WER | $0.37/hour | Entity capture (emails, IDs) |
| Gladia Solaria-1 | 103 ms partial | −29% WER on conv. | Custom | Accented, noisy, code-switched |
| Soniox | streaming | Best multilingual | Custom | Legal, medical, multilingual |
| OpenAI Whisper | ~500 ms chunked | Good | $0.02–0.06/hr | Offline / batch / low budget |
Our default is Deepgram Nova-3 for English-first US traffic, AssemblyAI Universal-3 Pro when the agent has to read back account numbers or emails, and Gladia for call centers with heavy accents or code-switching. Whisper belongs in batch jobs and transcription exports, not live calls.
LLM: Claude Haiku, Gemini Flash, or GPT-5
The LLM is where most voice-AI budgets break. TTFT (time-to-first-token) matters far more than total throughput — the agent must start speaking quickly, not finish quickly. A 600 ms TTFT that streams out at 80 tokens/sec feels faster than a 300 ms TTFT that stalls at the first sentence.
| Model | TTFT | Tokens/sec | Price (in) | Best for |
|---|---|---|---|---|
| Claude Haiku 4.5 | 597 ms | 78.9 | $0.80 / 1M | Default — fastest TTFT at voice-grade quality |
| Gemini 2.5 Flash | ~800 ms | 146.5 | $0.075 / 1M | Long responses, cheap, multimodal |
| GPT-5 | 0.9–1.2 s | ~60 | Premium | Complex reasoning, multi-hop tool calling |
| Groq (Llama 3.3 70B) | ~250 ms | ~300 | $0.59 / 1M | Low latency on a budget, OSS models |
Haiku 4.5 is our default for voice. Groq sits on the shortlist for anything latency-obsessed (sales dialers, gaming NPCs) where Llama-grade reasoning is enough. Reach for GPT-5 only when the agent is doing genuine reasoning — multi-step tool chains, constrained policy, high-value outcomes. It is too slow for casual conversation.
TTS: ElevenLabs, Cartesia, Aura, Rime, or Grok
TTS is the part users actually judge. It is also the stage where prices vary 10× for quality differences most callers never notice.
| Provider | TTFB | Price | Strength |
|---|---|---|---|
| ElevenLabs Flash v2.5 | 75 ms | $0.050 / 1k chars | Naturalness, cloning, 4,000+ voices |
| Cartesia Sonic-3 | 40–90 ms | $0.006 / min | Cheapest, fastest, SSM-based |
| Deepgram Aura-2 | 90 ms | $0.030 / 1k chars | Best value, healthcare/IVR tuned |
| Rime Mist | <200 ms | $0.005 / min | Conversational emotion, US accents |
| Grok TTS (xAI) | streaming | $4.20 / 1M chars | Code-switching, LiveKit plug-in |
We reach for Cartesia Sonic-3 when latency and cost matter and brand voice does not; Deepgram Aura-2 for the best naturalness-per-dollar ratio; ElevenLabs Flash when the client licenses a cloned voice or a celebrity talent. Spend the extra only where it changes the business case. On a B2B scheduling agent, nobody books a meeting because the voice has better vibrato.
Turn detection, VAD, and graceful interruption
Voice activity detection answers “is someone speaking right now?” Turn detection answers the harder question: “did they finish speaking, or just pause?” The difference is the gap between a conversation that breathes and one that steps on every other word.
LiveKit ships a trained turn-detection model that outperforms fixed-threshold VAD by a wide margin on natural speech. It looks at prosody, filler words, and lexical cues — not just volume — and tunes the wait window from 0 to 2 seconds dynamically. Combined with Krisp’s Background Voice Cancellation, it also handles the “coworker talking across the office” case that breaks every naive VAD.
Interruption (barge-in) is the other half of this story. When the caller cuts in, the agent has to stop talking within 150 ms, flush its TTS buffer, and hand the new utterance back to STT. LiveKit handles it automatically. The only common bug is the agent interrupting itself — usually caused by a noisy TTS signal bleeding back through the user’s mic. Fix with acoustic echo cancellation on the client or BVC on the agent.
Noise cancellation & Krisp: now a line item
LiveKit Cloud ships licensed Krisp models for two jobs: removing ambient noise (fans, traffic, keyboard) and cancelling background voices (a spouse on the phone, an open-plan office). Both improve STT accuracy by 10–20% on noisy channels and dramatically reduce false interruptions.
Starting 1 May 2026, Krisp usage is metered on top of the base agent minute. Budget $0.002–$0.004 per minute extra if you leave it on. Turn it off for clean-room audio (in-app desktop assistants, studio-quality microphones) to save the line item; leave it on for PSTN and mobile traffic where it pays for itself in STT accuracy.
Turn Krisp off when…
Your users wear headsets in controlled environments (call-center desks, studio mics, in-app WebRTC from laptops), or a separate noise-suppression stage already runs client-side.
Function calling that actually ships
A voice agent that cannot do anything is a chatbot that happens to talk. Function calling is the difference between “thanks for calling” and “I’ve booked you for Tuesday at 2’. Three rules keep it honest in production:
- Echo-back for anything the agent typed. “I heard four-one-five, two-two-five… is that right?” STT hears numbers the wrong way; AssemblyAI Universal-3 Pro is meaningfully better at entities than the rest, but nothing is perfect.
- Call synchronously, narrate asynchronously. Say “let me pull that up” while the tool executes. Do not leave a 2-second silence. Users assume the line dropped.
- Cap retries and budget. One hallucinating model with one broken tool can redial the same API 40 times in 10 seconds. Add hard caps on tool calls per turn and cost per session.
Claude and GPT-5 both support parallel tool calling, so a well-designed schema lets the agent fire “check calendar”, “lookup customer”, and “pull policy” concurrently and synthesize the answer in one turn. This is the single biggest hidden latency win we see teams miss.
True cost per minute: all-in, no surprises
Every managed platform advertises a headline rate — Vapi at $0.05, Retell at $0.07, Bland at $0.09 — and every one of them is the platform fee only. The real all-in cost once you add STT, LLM, TTS, and telephony lands between $0.11 and $0.25 per minute. Here is what three representative LiveKit stacks look like in 2026:
| Stack | STT | LLM | TTS | Platform | Total/min |
|---|---|---|---|---|---|
| Budget | Deepgram Nova-3 | Groq Llama 3.3 | Cartesia Sonic-3 | LiveKit Cloud | ~$0.05 |
| Production | Deepgram Nova-3 | Claude Haiku 4.5 | Deepgram Aura-2 | LiveKit Cloud | ~$0.10 |
| Premium | AssemblyAI Pro | GPT-5 / Claude 4.5 | ElevenLabs Flash | LiveKit Cloud | ~$0.18–0.22 |
Multiply by your call volume. A sales dialer running 50,000 minutes a month on the Production stack costs $5,000 in infrastructure — roughly one human SDR’s fully loaded cost, and the agent works 24/7 in fifteen languages.
LiveKit vs Vapi, Retell, Bland, Synthflow
At 10,000 minutes a month (a busy mid-market use case) the platforms sort cleanly by total cost of ownership:
| Platform | All-in cost | Time to first call | Ceiling |
|---|---|---|---|
| LiveKit (build) | $500–$600 | 2–4 weeks | None, open source |
| Retell AI | ~$1,150 | 3 hours | Custom tools limited |
| Bland AI | ~$1,200 | 1 day | Outbound dialer focus |
| Synthflow | ~$1,300 | 1–2 days | No-code, platform lock-in |
| Vapi | ~$1,400 | 1 day | Flexible API, highest cost |
Figure 2. 10,000-minute/month TCO comparison, 2026 rates.
The numbers say the obvious thing: Retell or Vapi for validation and sub-5K-minute prototypes, LiveKit the moment the use case is real. The breakpoint is somewhere between 10,000 and 20,000 minutes per month — past that, LiveKit’s margin pays back the engineering investment within six months.
LiveKit Cloud vs self-hosted vs Telnyx
Three ways to run LiveKit Agents in production, each with a different trade-off:
- LiveKit Cloud. $0.005/min audio-only, $0.01/min with an agent. Managed dispatch, observability, regional edges, SOC 2, BAA. Zero ops work. Default choice.
- Self-hosted. Zero platform fee, you pay STT/LLM/TTS directly. Break-even above 50,000 min/month if your DevOps is already in place; below that, managed wins. Pick this for regulated workloads where the platform cannot touch the audio.
- LiveKit on Telnyx (April 2026). Telnyx hosts LiveKit infrastructure and bundles telephony. Advertises 50% lower STT/TTS costs than LiveKit Cloud for the same stack — worth pricing if you need PSTN at scale.
One gotcha since February 2026: LiveKit observability data is processed in the US regardless of your media region. If GDPR residency forbids US processing of any call metadata, disable project-level observability and log to your own stack. Media itself (audio/video) stays in the region you pick.
Compliance: HIPAA, SOC 2, TCPA, GDPR
Voice touches three compliance surfaces at once — PHI (if your users discuss health), PII (every customer record), and recording consent (every state, many with teeth). Platforms help; they do not save you.
- HIPAA. LiveKit signs BAAs. So do OpenAI, Deepgram, and ElevenLabs at the enterprise tier. Verify each vendor in your stack and keep a countersigned PDF per contract review cycle.
- SOC 2 Type II. LiveKit, OpenAI, Deepgram, AssemblyAI, ElevenLabs, Cartesia all certified. Pull their reports for your own audit.
- TCPA. Since the FCC ruling of 8 Feb 2024, AI-generated voice on outbound calls requires documented prior express written consent. Store it alongside the phone number and surface it on every dial.
- Two-party recording consent. California, Florida, Pennsylvania, Illinois and seven other states require all-party consent. Play a disclosure at the start of every recorded call — and actually record the disclosure.
- GDPR. Recording consent under Article 6 must be explicit, not “legitimate interest.” Ensure data-processing agreements with every provider in the stack and pin media to an EU region.
Our production defaults: real-time PII redaction on all logs (scrub emails, phone numbers, card numbers before they hit observability), per-call consent metadata, audit trails with 90-day retention minimum, and an annual red-team pass on the agent’s prompt boundaries. Cutting corners here is what ends careers; see our video streaming security playbook for a deeper walkthrough of the same thinking applied to video.
HIPAA or GDPR in scope?
We’ve shipped voice agents with full BAA stacks for clinical intake and telehealth. If your call has PHI in it, you need more than a provider checkbox — you need the pipeline, the logs, the retention, and the red-team to hold up under audit.
Talk to a compliance-aware engineer →Use cases that actually pay for themselves
Four patterns return investment in under six months. Everything else is experimentation:
Tier-1 support deflection
Password resets, order status, balance checks, policy FAQs. A voice agent deflects 40–60% of call volume at $0.08–$0.15 per call versus $2–$5 with a human agent. For a 100,000-call/month support line, that is $150,000–$300,000 a month returned to the P&L.
Outbound sales & lead qualification
Inbound-lead callbacks go from “4-hour average response” to “30-second average.” Meeting yield typically doubles or triples. One enterprise we worked with saw outbound qualification cost fall from $48 per meeting booked to $9.
Clinical intake & appointment confirmation
Large clinic networks use voice agents for pre-visit history taking and 24-hour confirmation calls. 70–80% of confirmations complete fully automated, freeing front-desk staff for walk-ins. HIPAA stack required — see above.
In-app voice tutors & assistants
This one sits closest to our Fora Soft portfolio. We’ve built voice-first learning platforms, real-time coaching apps, and in-browser WebRTC agents for education clients like Career Point, and the playbook is the same: LiveKit client SDK in the browser, an agent worker with a Haiku + Cartesia stack, custom tools for the lesson state, and a feedback loop that records interactions for pedagogy review.
Seven failure modes we see in production
In order of how often they ruin demos:
- Cold-start latency. First call after deploy hits 3–5 seconds. Fix with idle warm-up calls and connection-pool preload on boot.
- Interruption misfires. Agent keeps talking over the user, or cuts them off mid-word. Tune VAD, enable BVC, and test with 20 real users before launch.
- Hallucinated facts. Agent invents a confirmation number. Enforce function-call grounding for anything quotable; ban free-text answers to factual queries.
- Entity-capture errors. “My email is john.doe” becomes “johndough”. Switch to AssemblyAI Universal-3 Pro for entity-heavy flows.
- Cascading region latency. STT in us-east, LLM in us-west, TTS in Frankfurt. Pin every provider to one region; pay the egress if you must.
- Cost runaways. Retry loop fires 40 LLM calls in ten seconds after a failed tool. Hard cap tool calls per turn and dollars per session.
- PII in logs. Full transcripts with SSNs end up in CloudWatch. Scrub at the edge before anything touches observability.
KPIs: what to measure from day one
A voice agent without instrumentation is a demo. Track these from the first production call:
- Turn latency P50 / P90 / P99. Averages lie; the P99 tail is what makes users hang up.
- Task completion rate. Did the agent finish what the caller wanted without escalation? 50–70% is the realistic benchmark for well-tuned support bots.
- Handoff rate. Escalations to humans per call. If it climbs, a tool is broken or the prompt is drifting.
- Cost per successful outcome. Not cost per minute — cost per meeting booked, per password reset, per appointment confirmed.
- Interruption rate. How often did the caller talk over the agent? Rising means the voice is too long-winded.
- Transcription WER. Sample 100 calls a week and hand-score. Drift here is the canary for every other metric.
When NOT to build on LiveKit
Four cases where a managed platform beats LiveKit honestly:
- Deterministic IVR. “Press 1 for sales, press 2 for support.” Use Twilio Studio and move on.
- Proof of concept on a tight deadline. Retell AI ships in 3 hours, Vapi in a day. If you need a demo tomorrow, do not build. Buy.
- Sub-5K minutes/month forever. The engineering payback period stretches past a year. The managed fee is cheap insurance.
- Non-technical team, no DevOps. Synthflow’s no-code builder beats a broken Python agent that nobody can maintain.
The honest question is “will this matter enough to justify owning it?” If the voice agent is core to the product or the savings are meaningful past ten thousand minutes a month, the answer is yes — and LiveKit is the right bet. Otherwise, start managed, migrate later.
FAQ
How long does it take Fora Soft to ship a production LiveKit voice agent?
Three to six weeks for a scoped use case (one workflow, one language, one compliance regime). Our Agent Engineering pipeline compresses what used to be a three-month delivery into half that, so typical costs land below market even on the premium stack.
Can LiveKit handle PSTN calls or is it only WebRTC?
Both. LiveKit bridges to PSTN via Twilio or Telnyx SIP. Since April 2026, Telnyx offers a native “LiveKit on Telnyx” product that bundles the two with roughly 50% lower STT/TTS costs than LiveKit Cloud alone.
What’s the realistic latency for a sub-700 ms agent?
With Deepgram Nova-3 streaming STT, Claude Haiku 4.5 streaming LLM, and Cartesia Sonic-3 streaming TTS, all pinned to the same region and with short system prompts, we routinely measure 550–700 ms P50 end-to-end on WebRTC clients. PSTN hops add 80–150 ms.
Should I use OpenAI Realtime API instead?
If latency is the single most important thing and you are fine locking to GPT-5 as your LLM, yes — 200–300 ms end-to-end is unbeatable. LiveKit supports Realtime as a plug-in, so you can try both without rewriting the agent. For compliance, tool-calling reliability, or custom voices, pipeline still wins.
How do we stop cost blowouts from broken tool calls?
Three hard caps per session: maximum tool invocations, maximum LLM tokens, maximum wall-clock duration. Trip any of them and the agent says “let me transfer you to a human” and hangs up to a fallback. We also enforce monthly cost alarms per agent tenant so runaway behaviour gets noticed within minutes, not days.
Can the agent hand off to a human mid-call?
Yes. A transfer_to_agent tool is the standard pattern — the LLM decides (or the user requests) handoff, the LiveKit room adds a human participant, and the AI agent steps out. Transcript and context follow so the human picks up the thread instead of starting over.
Is LiveKit a good fit for multilingual calling?
Very. Gladia Solaria-1 handles code-switching in a single stream (caller switches between Spanish and English mid-sentence), Soniox leads on 30+ languages, and Grok TTS recently shipped LiveKit-native multilingual voices. The pipeline stays the same; only the STT/TTS plug-ins change.
How do we benchmark our agent before launch?
Run 50–100 scripted conversations covering the top task types, measure turn-latency P99, task-completion rate, and hallucination rate. Then add 20 real human evaluators doing unscripted calls. The gap between the two tells you whether the agent is overfit to the prompt. We run this loop on every release at Fora Soft.
What to read next
Framework deep-dive
LiveKit AI Agents: The Engineer’s Guide
The bolts-and-wires companion to this playbook — SDK internals, worker patterns, deployment.
Video stack
Building a Video Streaming App in 2026
VOD, live, and conferencing on the same realtime transport LiveKit runs on.
Vendor alternatives
Agora.io Alternatives for Realtime Voice & Video
The comparison that made many teams consider LiveKit in the first place.
Case study
Career Point: AI Coaching Built with LiveKit + Oxford
How Fora Soft shipped a voice-first coaching platform with sub-800ms latency.
Compliance
Security Features for Realtime Media Apps
Encryption, recording consent, and audit logging patterns that generalize to voice AI.
How we build
Agent Engineering at Fora Soft
Why our delivery times on AI-heavy projects are faster and cheaper than the market.
Ready to ship voice AI that sounds human?
Tell us your use case. We’ll tell you the fastest production path and the honest number.
30-minute scoping call with a senior engineer who has actually shipped LiveKit agents — no sales pitch, no NDA required.
Book a 30-minute call →Last updated April 2026 with current LiveKit Agents, model, and pricing data. Sources include LiveKit documentation, Deepgram, ElevenLabs, Cartesia, AssemblyAI benchmarks, and Fora Soft production deployments.


.avif)

Comments