Blog: AI + WebRTC: How Smart Agents Are Changing Real-Time Communication

Key takeaways

AI agents on WebRTC are now sub-500ms. Speech-to-speech models like OpenAI gpt-realtime and Gemini Live hit 150–300ms end-to-end; cascading STT→LLM→TTS pipelines land at 500–800ms — both feel natural in conversation.

Two architectures dominate. Speech-to-speech is fastest and simplest but locks you into one vendor; cascading pipelines (Deepgram + open LLM + ElevenLabs/Cartesia) cost 3–5× less and let you swap models per call.

Cost spread is huge. OpenAI Realtime burns ~$0.20/min; a tuned cascade on Deepgram Nova-3 + Llama 8B + Cartesia Sonic runs $0.04–0.09/min. Pick the architecture that matches the unit economics of your call type.

Turn-taking, not raw latency, makes it feel human. VAD alone fails on noise and back-channels; production agents layer VAD, end-of-turn ML models, AEC and instant TTS pre-emption to handle barge-in inside ~300ms.

Compliance is the gate. If you touch healthcare, finance or EU users, the stack has to ship with HIPAA, SOC 2 Type II and GDPR data residency from day one — bolting it on after pilot doubles the budget.

Why Fora Soft wrote this playbook

We have shipped real-time video and voice products for 21 years and 625+ launches. WebRTC is the layer we touch every day — from BrainCert’s virtual classroom (1M+ learners, 500M+ minutes delivered across 10 datacenters) to VOLO.live, our real-time AI translation platform that ran for 22,000+ attendees at Black Hat Briefings 2025 and HIMSS.

When we say “AI agents on WebRTC” we mean the production reality: a peer joining the SFU as a participant, listening to a live RTP stream, calling a model in 200ms, streaming TTS back before the user finishes their next breath. That is harder than it looks. This guide is the synthesis we hand our own architects when a client asks “what should I actually build?” — an opinionated map of the four architectures that work, the cost math behind each, and the pitfalls that quietly burn pilots.

Read it as a CTO would: skip to the comparison matrix, the cost model, or the decision framework if that is what you need first. Or book a 30-min architecture review and we will walk through your stack live.

Choosing between OpenAI Realtime and a cascade?

We will benchmark both on your traffic profile and pick the one that hits your latency budget for the lowest unit cost. No deck, just numbers.

Book a 30-min call →

What actually changed in 2024–2026

Three things moved AI + WebRTC from demo to deployment. First, native speech-to-speech models — OpenAI’s gpt-realtime and Google’s Gemini Live ingest audio and emit audio without intermediate transcription, cutting two layers of latency and preserving prosody. Second, ultra-fast TTS: ElevenLabs Flash v2.5 ships at ~75ms time-to-first-audio, Cartesia Sonic at ~40ms, which makes the “I heard you” signal land before users get impatient. Third, agent frameworks built on top of SFUs: LiveKit Agents, Pipecat and Daily/Vapi turned a six-week integration into a one-week scaffold.

The result: a conversational AI that joins a WebRTC room as a participant, hears the audio track, holds a 30-turn conversation with state, hands off to a human with full context, and costs less per call than a human seat. In 2023 that was a demo. In 2026 it is the new baseline expectation in customer support, sales qualification, telehealth intake and live tutoring.

WebRTC in 90 seconds, for product owners

WebRTC is the open standard browsers and mobile apps use to exchange audio, video and data with sub-200ms latency. The full primer is in our WebRTC architecture guide; the short version is four pieces:

  • Media tracks — audio and video carried over UDP/SRTP with built-in DTLS encryption.
  • Signaling — the side-channel (your own server) that exchanges SDP offers/answers so peers can find each other.
  • ICE / STUN / TURN — NAT traversal so two peers behind firewalls actually connect.
  • SFU or P2P topology — for two participants peer-to-peer is enough; from three up you almost always want a Selective Forwarding Unit (LiveKit, mediasoup, Janus) routing media.

For an AI agent, the SFU is the connection point: the agent dials in as a normal participant, subscribes to the audio track of every human in the room, and publishes its own synthesized audio back. No special protocol — the same WebRTC the browser uses. That symmetry is what made the integration tractable.

The four architectures that actually ship

Almost every production AI + WebRTC system collapses into one of four patterns. Pick by your latency budget, your willingness to be vendor-locked, and the call volume you need to serve.

1. Speech-to-speech (OpenAI Realtime, Gemini Live)

A single multimodal model takes audio in and emits audio out. The agent connects via WebRTC (browser) or WebSocket (server) directly to OpenAI or Google. Lowest end-to-end latency (150–300ms), best prosody, simplest code — but you cannot swap the LLM, you cannot inspect the “reasoning,” and the per-minute cost is the highest on the market.

Reach for speech-to-speech when: latency must be under 400ms, the conversation is short (under ~3 minutes) and you can absorb $0.15–0.20/minute of audio cost. Sales bots, voice-first onboarding, premium IVR.

2. Cascading STT → LLM → TTS pipeline

The classic stack: Deepgram or AssemblyAI streams transcripts; an LLM (GPT-4o, Claude, Llama 3 8B/70B) generates a response; ElevenLabs or Cartesia synthesizes speech back. Orchestrated by LiveKit Agents or Pipecat. Total latency 500–800ms, but every layer is swappable, observable and cheap. Most enterprise deployments live here.

Reach for a cascade when: you need to A/B different LLMs, log every transcript for compliance, or push unit cost under $0.10/min at scale. Contact centers, healthcare intake, customer support.

3. SIP / PSTN bridge to WebRTC agent

For inbound or outbound phone calls. A SIP trunk (Twilio, Telnyx) terminates into a WebRTC SFU; the agent runs the same as in pattern 2. Telephony adds 100–300ms of jitter, but the upside is PSTN reach and recognized regulatory paths. Vapi and Retell AI are the managed wrappers; LiveKit, Pipecat and Plivo are the build-it-yourself routes.

Reach for a SIP bridge when: the call originates on a phone number, not in your app. Outbound sales, appointment reminders, insurance claims, automotive service scheduling.

4. Multimodal vision agent on a live video track

The agent subscribes to both audio and video tracks, sends sampled frames to a vision model (GPT-4o, Gemini 2.5, Claude 4 multimodal), correlates with speech, and replies. Per-frame inference adds 400–800ms, so frame rates are dropped to 1–3 fps for reasoning while audio stays full-rate. Used for KYC verification, in-app guided onboarding, manufacturing QA, and remote field support. See our deep dive on multimodal agents with LiveKit for the production wiring.

Reach for multimodal vision when: the agent needs to see what the user sees — ID documents, a damaged product, a UI screen, a physical workspace. Telehealth triage, insurance claims with damage photos, AR-assisted field service.

Comparison matrix — the four architectures, side by side

Pattern E2E latency Cost / min Vendor lock Setup time Best for
Speech-to-speech (OpenAI / Gemini) 150–300ms $0.15–0.20 High ~1 day Voice-first onboarding, sales bots
Cascade STT→LLM→TTS 500–800ms $0.04–0.09 Low 2–3 weeks Contact centers, healthcare intake
SIP / PSTN bridge 600ms–1.2s $0.06–0.16 Medium 2–4 weeks Inbound / outbound phone calls
Multimodal vision agent 700ms–2s $0.10–0.30 Medium 4–6 weeks KYC, telehealth triage, field service
Open-source on-prem (Whisper + Llama + Coqui) 600–900ms ~$0 marginal (CapEx) None 6–10 weeks Privacy-critical, regulated, edge

The latency budget — where every millisecond goes

Humans start to perceive a pause around 250–300ms of silence. By 500ms it feels “a beat slow,” by 800ms users believe the line dropped, by 1.5s they hang up. To hold conversation feel, the round trip from “user stops talking” to “agent first audio plays” needs to land under 800ms — and ideally under 500ms.

In a cascade, the budget breaks down roughly as:

Stage Typical (ms) Aggressive (ms)
User audio → SFU (network) 40–100 30
End-of-turn detection (VAD + endpointing) 400–600 200–300
Streaming STT (Deepgram Nova-3 P50) 200–300 ~150
LLM time-to-first-token 200–400 100–150 (Llama 8B / Groq)
TTS time-to-first-audio 100–200 40–75 (Cartesia / EL Flash)
SFU → user audio 40–100 30

Two stages dominate: end-of-turn detection and LLM TTFT. Cut endpointing aggressively and the agent talks over users; pick a slow LLM and the entire pipeline stalls. The art is moving the LLM’s first sentence into TTS as soon as the first 8–15 tokens stream out (“speculative speaking”), so audio starts before the model finishes thinking.

Turn-taking, barge-in and why naive VAD breaks the demo

Latency makes the agent fast. Turn-taking makes it polite. Most failed pilots we have audited had clean STT and reasonable LLMs but felt unnatural because the agent stepped on the user, paused too long after backchannels, or could not be interrupted mid-sentence.

A production turn manager needs four layers:

1. Continuous VAD — a 30–50ms window classifier (Silero VAD, WebRTC VAD) that flags presence of voice. Cheap, runs on CPU, fires on any sound including coughs and keyboard clicks.

2. End-of-turn model — a small ML head (LiveKit ships one, Pipecat has SmartTurn) that scores “is this user actually done?” from the audio + last partial transcript. Cuts ~300ms off naive 800ms silence timeouts without choppy interruptions.

3. Acoustic Echo Cancellation — mandatory if the agent’s TTS plays through the user’s speaker and the mic re-captures it. Without AEC the agent will “hear itself” and respond to its own voice. Browsers ship usable AEC; native apps and headless agents need to wire it explicitly.

4. Instant TTS pre-emption — the moment a real interruption is detected (not just a “mm-hmm”), the audio buffer is flushed, the LLM generation cancelled, the conversation state rewound to the partial response, and STT picks up the new user turn. Target ~300ms from interruption start to silence on the agent side.

The tooling layer — pick your framework, then your providers

Below is the shortlist of frameworks our team actually uses in production. None of them are wrong; they just optimize for different things.

Framework Strength Trade-off Best fit
LiveKit Agents Best SFU; native VoicePipelineAgent and MultimodalAgent; SOC 2 + HIPAA cloud Higher base price than self-hosted Enterprise teams that want managed scale
Pipecat (Daily) MIT-licensed orchestration; modular, easy to swap providers You own the deploy and observability Cost-conscious teams with DevOps
Vapi / Retell AI Phone-first; outbound dialing in days; managed compliance Less control; markup over raw provider costs Outbound sales, appointment reminders
OpenAI Agents SDK + Realtime Fewest moving parts; best-in-class prosody Single-vendor; highest per-minute Premium voice products, MVPs
mediasoup + custom orchestrator Full source control; on-prem and edge deployable Months of build, not weeks Regulated, on-prem, sovereign cloud

For STT we usually pick Deepgram Nova-3 (lowest latency, $0.0043/min) or AssemblyAI Universal-2 (better accuracy on accented English at $0.0019/min). For TTS, Cartesia Sonic when latency is the bottleneck and ElevenLabs Flash v2.5 when voice quality must convince. For LLMs the picture is fluid — Llama 3.1 8B on Groq for cost, GPT-4o or Claude 4 for tool-heavy workflows, and an in-house fine-tune for any domain you cannot afford to ship to a third party.

Stuck choosing between LiveKit, Pipecat and Vapi?

We have shipped on all three. Bring your call profile and budget; we will recommend the framework that hits both, plus the providers under it.

Book a 30-min call →

A reference architecture you can copy

A clean cascade for a B2B SaaS support voice agent looks like this:

  Browser / Mobile (WebRTC client)
        |  audio + video tracks
        v
  LiveKit / mediasoup SFU  ────────►  Recording & transcript store (S3 + Postgres)
        |  audio track subscribed
        v
  Agent worker (Python / Node)
   ├─ VAD (Silero)
   ├─ End-of-turn model
   ├─ Streaming STT (Deepgram Nova-3)
   ├─ LLM orchestrator (LangGraph / Pipecat)
   │   ├─ tool calls → CRM / KB / payments
   │   └─ guardrails + PII redaction
   ├─ Streaming TTS (Cartesia Sonic)
   └─ Barge-in / pre-emption controller
        |  publishes synthesized audio
        v
  SFU → user
        |
        └─► Observability (OpenTelemetry → Datadog / Grafana)

Every box on the right is optional in a pilot but mandatory at production: recording for QA, observability for latency SLOs, guardrails for PII redaction, tool calls for the agent to actually do something useful (book a meeting, refund an order, check a KB). The interesting design decision is where state lives — we keep conversation history in Redis with a 24-hour TTL and dump full session transcripts to Postgres for long-term analytics.

A worked cost model — 100,000 call-minutes a month

Numbers cut through architectural arguments faster than diagrams. Take a typical mid-market customer support deployment: 100,000 user-minutes per month, average call length 3.5 minutes, agent talks ~40% of the time. Below is the monthly bill for the three architectures we ship most often.

Cost line OpenAI Realtime Cascade (managed) Cascade (self-hosted)
SFU media routing $400 (LiveKit Cloud) $400 (LiveKit Cloud) $300 (Hetzner AX52 cluster)
STT included $430 (Deepgram Nova-3) $190 (Whisper.cpp on-prem)
LLM included (in $/min) $1,800 (GPT-4o-mini) $700 (Llama 3.1 8B / Groq)
TTS included $2,400 (ElevenLabs Flash) $1,200 (Cartesia Sonic API)
Realtime model audio $18,000 ($0.18/min avg)
Total / month (approx.) ~$18,400 ~$5,030 ~$2,390

A 3.6× cost gap between the simplest path and the cheapest cascade is enough to fund a small engineering team. We typically recommend the managed cascade for the first 6–12 months (it ships fastest and the team builds operational muscle), then migrate the LLM and TTS layers to self-hosted once volumes justify it. Numbers are illustrative; real bills depend on call mix, voice cloning costs, and reserved-instance discounts.

Mini-case — how VOLO.live runs real-time AI for 22,000 attendees

Situation. A live-event translation product needed to deliver real-time speech-to-text and translated voiceover for global conferences — HIMSS, Black Hat Briefings, GDC. Latency had to feel simultaneous, audio had to sync to the speaker on stage, and a single technical hiccup would be visible to thousands of paying attendees.

What we built. A WebRTC ingest from the venue, audio piped into Speechmatics and Google Cloud Speech for streaming STT, an AI translation layer producing both subtitles and voiceover in 25+ target languages, a NestJS backend orchestrating language switching, and a Next.js attendee app reachable through a venue QR code. Speakers and organizers got admin panels for live language enable/disable; attendees picked their language in two taps.

Outcome. Deployed for Black Hat Briefings 2025 (22,000+ attendees), HIMSS, GDC and other top-tier conferences. Real-time translation with sub-second perceived lag for subtitles and natural-sounding voiceover, plus a license-friendly multi-vendor STT/translation stack the customer can scale. The full case is on the VOLO project page. Want a similar walkthrough of your own real-time AI stack? Book a 30-min architecture review.

Security and compliance — what to design for from day one

WebRTC encrypts media in transit by default (DTLS-SRTP). The compliance work is everywhere else: at the SFU (which terminates encryption to route), at the LLM provider (which sees transcripts), at storage (recordings and transcripts), and at the consent layer (you must disclose AI participation in most jurisdictions).

1. HIPAA (US healthcare). You need a signed BAA with every vendor that touches Protected Health Information — SFU, STT, LLM, TTS, storage. LiveKit Cloud, Deepgram and OpenAI all offer BAAs on enterprise tiers; ElevenLabs only on its enterprise plan. PHI in transcripts must be encrypted at rest and access-logged.

2. GDPR (EU users). Data residency is the trap — the SFU and the LLM must run in the EU if any user data touches them. OpenAI offers EU data residency on enterprise; many open-source self-hosted stacks are simpler than the paperwork.

3. SOC 2 Type II. Required by most enterprise procurement. Audit covers security, availability and confidentiality of the whole stack. Pick vendors that already carry it (LiveKit, Deepgram, AssemblyAI, OpenAI, Cartesia) so the chain is intact.

4. PCI-DSS. If the agent ever needs a card number, route the audio segment through a tokenization vendor (Cresta, AudioCodes) so the LLM never sees the raw PAN. Never let GPT-4 transcribe a credit card.

5. Consent & AI disclosure. California, Illinois, Colorado and the EU AI Act all require clear disclosure that the user is talking to an AI. Bake the disclosure into the first 5 seconds of every call and log the user’s acknowledgement.

Five pitfalls that quietly kill production agents

1. Cold-start TTFT. If the LLM container scales to zero, the first call after a quiet period waits 2–5 seconds for model load. Keep at least two warm replicas with a synthetic ping every 30 seconds, or use a managed endpoint that holds memory for you (Groq, OpenAI, Anthropic).

2. Context bloat. A 30-turn conversation can balloon past 10K tokens; LLM cost grows linearly, latency super-linearly. Summarize every 8–10 turns into a rolling state object and drop the raw transcript from the prompt; keep the full transcript in storage for QA.

3. Verbose TTS. Per-character TTS pricing means a chatty model is an expensive one. Cap responses at ~80 words, instruct the LLM to be terse, and prefer extractive answers (“your last invoice was $87”) over generative ones (“great question, let me explain…”).

4. No human handoff path. The agent will fail. Your stack must transfer the call to a human with the full transcript, a one-paragraph summary, and the open intent — not a cold transfer that makes the user repeat themselves. Measure handoff success rate as a top-line KPI.

5. Treating the SFU as “just plumbing.” SFU placement decides geography of latency: a US-East SFU serving an APAC user adds 200–300ms before the agent even hears them. Pick an SFU vendor with edge POPs that match your user map, or self-host on Hetzner / OVH / Equinix in your top regions.

KPIs — what to measure once you are live

Quality KPIs. P50 and P95 round-trip latency (target P50 < 600ms, P95 < 1s), word error rate on STT (< 8% for support workloads), interruption false-positive rate (< 5%), AEC self-trigger rate (~0).

Business KPIs. Containment rate / call deflection (% of calls fully handled by AI, target 50–75% in support, 80%+ in self-service queries), CSAT on AI-handled calls (within 0.3 of human-baseline), cost per resolved interaction (target < 25% of human cost), upsell or qualified-lead rate on outbound.

Reliability KPIs. Agent uptime (≥ 99.9%), call drop rate (< 0.5%), handoff success rate to human (> 98% with full context), guardrail catch rate on PII / unsafe content (track precision and recall separately).

When not to ship an AI agent on WebRTC

Three scenarios where the answer is no, or not yet. Highly emotional or safety-of-life calls — suicide hotlines, post-trauma intake, end-of-life decisions. AI is not the right first responder; route to humans, use the AI for backend support and post-call summarization. Domains with hallucination cost > latency benefit — legal advice, controlled-substance prescribing, regulated financial advice. The cost of one wrong sentence outweighs three months of saved minutes. Volumes below ~5,000 minutes a month — the integration, monitoring and compliance overhead does not amortize. Use a chatbot or a human until you cross the threshold.

A decision framework — pick your architecture in five questions

Q1. What is the strict latency ceiling? Under 400ms means speech-to-speech (OpenAI Realtime, Gemini Live). 500–800ms means a managed cascade. Above that, anything works.

Q2. What is the unit-cost ceiling per call? Below $0.10/min forces a cascade with cost-tuned providers (Deepgram + Llama 8B + Cartesia). $0.20+/min budget unlocks the simpler vendor stacks.

Q3. Where do calls originate? Inside your app → pure WebRTC. From phone numbers → SIP bridge through Vapi, Retell or LiveKit Telephony. Both → build the cascade once, plug both transports.

Q4. What compliance regime applies? HIPAA + EU GDPR + PCI together push you toward LiveKit Cloud + AssemblyAI + Anthropic Claude on AWS Bedrock with EU data residency, or fully self-hosted open-source.

Q5. Will the agent need vision? If yes (KYC, telehealth, field service), build on LiveKit Agents’ MultimodalAgent or Pipecat with a vision model running at 1–3 fps. Otherwise, stick to audio-only — vision triples cost and latency.

Want a second opinion on your AI voice stack?

A 30-minute review with our voice-AI architects: latency budget, cost per call, compliance gaps and a 12-week roadmap to production.

Book a 30-min call →

Use cases that already pay back in 2026

Tier-0 customer support. Voice agent fields password resets, order status, billing FAQs and routes everything else with full context. Industry deployments report 50–80% containment on these queries; cost per resolved interaction drops to roughly a quarter of a human seat.

Sales discovery and qualification. The agent runs the first call, asks ICP questions, scores the lead, books a meeting on the AE’s calendar. Useful for high-volume inbound where SDRs cannot keep up. See our deeper write-up on AI call assistants for vendor specifics.

Telehealth intake and follow-up. AI gathers symptoms, medication list and consent before the clinician joins, then handles routine post-visit follow-ups. Pairs naturally with our telemedicine platform work — HIPAA-grade WebRTC plus AI agent with PHI guardrails.

Live tutoring and onboarding. The agent joins a class, follows the lesson, answers learner questions, summarizes for the teacher. Built well, it lifts engagement without replacing the instructor — the same pattern BrainCert uses across 1M+ learners.

Real-time meeting copilots. Live transcription, action items, summaries delivered to email in minutes. The infrastructure is the same as a voice agent — WebRTC SFU + STT + LLM — minus the synthesized voice. Our meeting translation comparison covers the adjacent space.

A 12-week roadmap from zero to live agent

Weeks 1–2 — discovery and call analysis. Pull 200 representative recorded calls. Tag intents, escalation triggers, PII patterns, and the failure modes the agent must handle. Pick the architecture using the framework above.

Weeks 3–6 — pilot build. Wire SFU + STT + LLM + TTS for one intent (e.g. order status). Ship to internal users first; instrument latency and CSAT.

Weeks 7–9 — closed beta. 5–10% of real traffic, A/B against human baseline. Wire human handoff with context. Tune endpointing, AEC, and prompts.

Weeks 10–12 — production rollout. Expand to 100% of the chosen intent, add observability dashboards, set SLOs, plan the second intent. By end of week 12 you should have a defensible cost-per-resolution number and a clean migration plan to add the next workflow.

Native multimodal models replace cascades. Single models that ingest audio + video + text and emit speech (and soon, video) will collapse three-vendor stacks into one API call. Expect lower latency and tighter cross-modal reasoning, at the cost of even more vendor concentration.

Edge inference normalizes. Open-source 7–13B LLMs running on NVIDIA Jetson or local AMD GPUs bring sub-200ms latency and zero data egress. Regulated industries (defense, healthcare, public sector) are the early adopters; expect retail and field-service to follow.

Agent-to-agent calls. Two AI agents negotiating a refund, scheduling between calendars, or completing supplier onboarding. Still experimental; the loop-prevention and authority models are not solved, but the protocol work is happening at LiveKit, Daily and the new W3C agent task forces.

Voice biometrics inside WebRTC. Identity confirmed by the user’s voice during the call rather than a separate step. Reduces friction in banking and healthcare; brings new privacy regulations.

FAQ

Is WebRTC secure enough for AI agents handling sensitive conversations?

WebRTC media is encrypted with DTLS-SRTP by default, which covers the wire. The exposure points are the SFU (it terminates encryption to route packets), the LLM provider (it sees transcripts), and your storage. For HIPAA, GDPR or SOC 2 you need BAAs / DPAs with every vendor on that path and access controls plus audit logs on storage.

Do I have to rebuild my WebRTC product to add an AI agent?

Almost never. The agent joins your existing SFU as a normal participant: it subscribes to the audio tracks of the humans in the room and publishes its own synthesized audio. If your SFU is LiveKit, mediasoup or Janus the integration is days, not weeks. The work that takes time is the prompt design, guardrails, observability and human-handoff path — not the WebRTC layer itself.

How long does a production deployment really take?

A focused pilot for one intent ships in 4–6 weeks. A production rollout with monitoring, compliance, human handoff and at least two intents typically takes 12 weeks. Multimodal vision agents add another 2–4 weeks for vision pipeline tuning.

What is the cheapest sane stack for a production voice agent?

A self-hosted Pipecat orchestrator on a Hetzner box, mediasoup as the SFU, Whisper.cpp for STT, Llama 3.1 8B served on Groq or local GPU, and Cartesia Sonic for TTS. Marginal cost per call drops below $0.05/min at moderate volume. The trade-off is you own the operations, observability and security work.

Will the agent replace our human support team?

Almost no successful deployment we have seen replaces humans wholesale. The pattern that works is AI handles the bulk of routine queries (50–80% containment) so humans focus on complex, emotional and revenue-critical conversations. Headcount usually stays flat while volume per agent doubles.

How do you measure AI agent success in production?

Three buckets: quality (P95 latency, WER, interruption accuracy), business (containment rate, CSAT vs. human baseline, cost per resolution), reliability (uptime, drop rate, handoff success with full context). Pick a primary KPI per call type — usually containment or conversion — and track everything else as guardrails.

Speech-to-speech (OpenAI Realtime) or cascade — which should I start with?

If you need to ship in days and call duration is short, start with OpenAI Realtime; you will validate the product before you optimize the stack. If unit economics matter from day one or compliance forces you to inspect every layer, start with a cascade on LiveKit Agents or Pipecat. Many of our clients prototype on Realtime, then migrate the chatty intents to a cascade once volume justifies it.

Can the agent handle non-English languages and accents?

Yes, but choose providers carefully. AssemblyAI Universal-2 and Deepgram Nova-3 cover 30–100 languages with measurable accuracy on accented English. Cartesia and ElevenLabs ship multilingual voices; for languages outside the top 30, expect to fine-tune. Latency and accuracy both degrade somewhat outside English — budget extra QA time.

Build guide

Build and deploy LiveKit AI voice agents

Step-by-step business guide to shipping a LiveKit-based agent.

Multimodal

Multimodal AI agents with LiveKit

Voice + vision: production wiring for camera-aware agents.

Architecture

WebRTC architecture guide for 2026

P2P, SFU, MCU and hybrid topologies explained for product owners.

Voice AI

AI voice assistant development in 2026

A complete guide for product owners commissioning voice AI.

Vendor choice

Agora.io alternatives for custom WebRTC

LiveKit, mediasoup, Jitsi and Janus compared in production.

Ready to put an AI agent on your WebRTC stack?

AI agents on WebRTC moved from showpiece to production baseline in two years. Speech-to-speech models give you sub-300ms latency for premium voice experiences. Cascading pipelines built on LiveKit or Pipecat give you 3–5× lower unit costs and full observability. Either way, the SFU layer your team already runs is the connection point — the integration is days of code on top of months of tuning, compliance and QA.

If you have a real-time product and you are not at least piloting an AI agent, your competitors are. The opportunity is to ship the right architecture for your unit economics, not the trendiest one — and to engineer the unglamorous parts (turn-taking, AEC, handoff, compliance) that decide whether the agent feels human or not.

Ship a real-time AI agent in 12 weeks?

Bring your call profile, latency budget and compliance requirements. We will leave the call with a recommended stack, a 12-week plan and a defensible cost-per-call number.

Book a 30-min call →

  • Development
    Technologies