
Key takeaways
• gpt-realtime is GA, priced and stable. Audio in $32/M tokens (~$0.06/min), audio out $64/M (~$0.24/min), cached input drops to $0.40/M — an 80× saving when prompt caching is wired correctly. About $0.30/min all-in on a typical conversation.
• 800 ms voice-to-voice is achievable, not free. Time-to-first-byte from the API is ~500 ms in US regions, leaving ~300 ms for capture, VAD, network and rendering. Long sessions drift higher; you must rotate sessions or trim history aggressively.
• The transport choice is the architecture. WebRTC for browsers and apps, WebSocket when your server orchestrates compliance and tools, SIP for telephony. Pick wrong and you fight latency for the rest of the project.
• Realtime API is NOT yet HIPAA-eligible under the OpenAI BAA (May 2026). Azure OpenAI text endpoints are; the audio modality is not. If you build a healthcare voice agent, you either route audio through HIPAA-eligible STT/TTS plus text reasoning, or you wait.
• Below ~10k minutes/month, Vapi or Retell beats DIY. Above that, OpenAI Realtime + LiveKit Agents wins on cost and control. The cross-over math is straightforward and we work it below.
Why Fora Soft wrote this playbook
Fora Soft has built voice and video products since 2005 — over 200 of our 625+ shipped projects involve real-time audio. That includes TransLinguist (real-time multilingual interpretation in production with the NHS UK), Hospital Phone Interpreter, two AI receptionist deployments under NDA, and a sales-call coach embedded inside a meeting platform.
In the last twelve months we have shipped four production voice agents on the OpenAI Realtime API and a further five on LiveKit Agents talking to OpenAI’s text models. We have also done two migrations: one off Vapi onto a self-managed OpenAI Realtime + LiveKit stack (cost driver), and one in the other direction (a healthcare client where the HIPAA gap forced a retreat to a chained pipeline). The numbers, latencies and design decisions in this article come from those engagements, not from a cookbook.
If you are scoping a voice agent in 2026, this guide tells you when gpt-realtime is the right pick, what production architecture beats the latency target, where the costs really land, and the three places where the easy answer is “use something else.”
Need a voice agent that ships in 8 weeks, not 8 months?
Send us your use case, expected call volume and compliance constraints. We will return a one-page architecture, cost forecast and 8-week plan within 48 hours, free.
What changed when gpt-realtime went GA
OpenAI’s Realtime API spent most of 2024 in preview as gpt-4o-realtime-preview. Pricing was high, latency was uneven, and the model had a habit of running over the user’s turn. In late 2025 OpenAI released gpt-realtime as the production model and added a smaller gpt-realtime-mini for cost-sensitive workloads. The 2026 status quo:
1. Pricing dropped 20 %. Audio input now $32 per million tokens (about $0.06 per minute), audio output $64 per million (~$0.24 per minute). Cached input is $0.40 per million tokens — the prompt-cache discount is roughly 80×, and using it correctly is the single biggest cost lever in production.
2. Function calling is solid. The model can call your tools mid-conversation, get results back as messages, and resume the audio response coherently. In our tests, function-call latency adds 400–800 ms when the tool itself responds in under 200 ms; tool latency above that is felt by the user as a pause.
3. Turn detection is configurable. Server-side VAD is the default. You can switch to none mode and drive turns from your client (push-to-talk style), or to a semantic VAD that uses the model’s own listening to decide when the user finished. Each has a UX consequence; we cover the choice in §8.
4. Native WebRTC support. The Realtime API exposes both WebRTC and WebSocket transports. WebRTC is the right pick for browser and mobile clients — you skip the WebSocket buffering layer and get sub-second perceived latency. WebSocket is the right pick when your server has to mediate the connection for compliance, observability, or business logic.
5. SIP is in beta. You can connect a phone number through OpenAI’s SIP integration and get a voice agent on a PSTN line without bringing your own telephony layer. In our experience this works for low-volume use cases (small dental practice receptionist) but lacks the routing, recording and DID-pool features you need for a contact-centre product.
Speech-to-speech vs the chained pipeline
Voice agents historically chained three models — speech-to-text (STT) like Deepgram or Whisper, a text-only LLM, and a text-to-speech (TTS) model like ElevenLabs or Cartesia. Each hop adds latency, each hop has its own failure modes, and prosody is lost between the user’s tone and the LLM’s response.
The Realtime API replaces all three with a single speech-to-speech model. The model takes raw audio in, produces raw audio out, and reasons in between. The latency floor drops because there is no STT-finalise wait and no TTS startup delay; the prosody floor rises because the model hears emotion and produces emotion.
It is not a free trade. Speech-to-speech models are harder to debug because the “text” intermediate representation is implicit; tool-call accuracy is slightly lower than a chained pipeline using a tool-tuned text model; and you have less control over voice (the gpt-realtime voices are good but limited to OpenAI’s catalogue). When voice cloning, custom prosody, or strict tool reliability matter more than the last 200 ms of latency, the chained pipeline still wins.
Reach for speech-to-speech (gpt-realtime) when: the conversation is open-ended, prosody matters (sales, coaching, customer empathy), and your tool calls are simple lookups. Sub-second voice-to-voice is your selling point.
Reach for chained (Deepgram + GPT-4 + ElevenLabs) when: compliance demands HIPAA-eligible STT/TTS that the Realtime audio modality does not yet have, or you need a custom-cloned voice, or your tool-call reliability has to be near-deterministic.
WebRTC vs WebSocket vs SIP — pick your transport
The Realtime API has three transport modes. Each maps cleanly to a use case, and choosing wrong forces you into latency or compliance trouble that no amount of prompt engineering will fix.
| Transport | Where the connection lives | Best for | Latency cost | Server-side control |
|---|---|---|---|---|
| WebRTC | Browser/app ↔ OpenAI direct | Web/mobile UX, demos, internal tools | Lowest (no server hop) | Limited — ephemeral keys, edge tokens |
| WebSocket | Your server ↔ OpenAI | Compliance logging, multi-tool orchestration, business-policy injection | + ~80–150 ms via your server | Full — you mediate every event |
| SIP (beta) | Phone number ↔ OpenAI | PSTN voice agents (receptionist, after-hours line, simple IVR) | + telephony hop (~50–100 ms) | Limited routing & recording features |
| LiveKit/SFU + WebRTC | Browser ↔ SFU ↔ agent server ↔ OpenAI | Multi-party calls, voice in video conferences, server-side tools and audit | + SFU hop (~30 ms) & agent hop (~50 ms) | Full — agent runs in your VPC |
For most production deployments we end up on the LiveKit + WebRTC pattern. It gives the user-perceived latency of WebRTC plus the server-side control of WebSocket: the agent runs on your infrastructure, calls into OpenAI’s WebSocket, mediates tool calls, redacts PII, writes the audit log, and forwards the audio to the user via the SFU. Our LiveKit AI Agents playbook covers the SFU side end-to-end.
Reach for WebRTC direct when: the use case is a quick prototype or an internal tool, you do not need to mediate calls server-side, and you can ship ephemeral keys safely from your backend.
Reach for WebSocket via your server when: compliance, audit logging, multi-tool orchestration, or business-policy injection sits on the critical path. The 80–150 ms hop is worth it.
Reference architecture — what production looks like
A production-grade voice agent has six layers: client, edge media, agent runtime, model API, tool plane, and observability. Each does one job, and the boundaries between them are where 90 % of bugs live.
Figure 1. Production-grade voice agent — client, SFU, agent runtime, OpenAI Realtime, tool plane, observability.
The six layers in detail
1. Client. Browser, mobile app, or telephony endpoint. Audio capture is 16-bit PCM at 24 kHz (the Realtime API’s native rate; sending 48 kHz wastes bandwidth and adds resampling latency). Echo cancellation is non-negotiable on the client — without it, the agent hears its own voice and barges in on itself.
2. Edge / SFU. A LiveKit, mediasoup or similar SFU sits between the client and your agent runtime. It handles fan-out (single user to multiple subscribers, e.g. for live coaching with a human supervisor listening in), TURN/STUN for NAT traversal, and isolation between rooms. SIP-PSTN bridges connect telephony into the same SFU so phone callers join an “agent room” just like browser users.
3. Agent runtime. A long-running process — LiveKit Agents, a custom Node/Python service, or Pipecat — that pulls audio off the SFU, drives the OpenAI WebSocket, dispatches tool calls, redacts PII, and writes the audit trail. This is where you put your business logic. It is also where compliance lives.
4. Model API. The OpenAI Realtime endpoint. The agent runtime opens a WebSocket, sends a session.update with prompt + tools + voice, and starts streaming audio. The model streams audio back as it generates. There is no “wait for the whole response” step.
5. Tool plane. Calendar, CRM, vector store for RAG, payment provider, internal APIs. Each tool is exposed to the model as a JSON schema; the model picks one mid-conversation and the runtime executes it. Tools must be idempotent (the model may retry), strictly typed (the model may hallucinate fields), and audited (you need the trail for compliance and debugging).
6. Observability plane. Helicone, LangSmith or your own stack capturing every session: full audio, full transcript, every tool call, latency p50/p95 per turn, token usage, cost per session. We have lost two days of debugging time on a production agent because nobody enabled audio capture; do not skip this.
Latency budget — how to actually hit 800 ms
800 ms voice-to-voice is the threshold above which conversation feels noticeably laggy. Humans converse at roughly 200 ms response time; 800 ms is “digital but acceptable.” Above 1.2 s users start to talk over the agent. Below is the budget we use to engineer for 800 ms total.
| Stage | Typical | How to compress it |
|---|---|---|
| Mic capture & encode | 20–40 ms | Native echo cancellation, 24kHz PCM, no resample |
| Network: client → agent | 30–80 ms | SFU close to user (regional), QUIC/WebRTC |
| Agent → OpenAI WS | 10–60 ms | Agent in same region as OpenAI endpoint (US-East) |
| VAD & turn detection | 200–400 ms | Server VAD threshold tuning OR client push-to-talk |
| Model TTFB (audio) | ~500 ms | Use cached prompt, keep system instructions tight |
| Audio → client & render | 50–100 ms | Stream audio chunks, do not wait for response.done |
| Total target | ~800 ms | Achievable with discipline; long sessions drift higher |
Two latency traps deserve separate calling out.
Long-session drift. Reports from the OpenAI Developer Forum and our own production data show median turn latency climbing from ~800 ms early in a session to over 2 s after 20+ turns. The fix is conversation pruning — rotate sessions every 8–12 turns by reseeding context into a fresh session, or use server-side conversation-item deletion to keep the active context bounded.
Tool-call latency. A tool call adds the round-trip latency of your tool. If your CRM lookup takes 800 ms, the user hears a pause. Front-load common lookups (greet the user with their name already known), parallelise multiple tool calls when the model issues them in batch, and consider speculative tool calls (“the next likely lookup”) for the second turn while the first response is still being spoken.
Latency drifting in your voice agent prototype?
We will run a 1-week instrumented latency audit, return a ranked fix list, and (if it makes sense) ship a prototype patch in week 2.
Cost model — the real $/minute, not the headline
Headline pricing is misleading; the actual cost depends on conversation pattern, prompt-cache hit rate, and tool latency. Here is the real math, with a worked example from a deployment we shipped this year.
The published rates (gpt-realtime, May 2026): audio input $32 per million tokens, cached audio input $0.40 per million, audio output $64 per million. Reference: 1 minute of audio is roughly 800–1 200 input tokens and 1 500–2 000 output tokens, depending on speaking rate. So a typical 1-minute audio exchange costs about $0.06 input + $0.24 output = $0.30, before any caching.
Worked example: AI receptionist for a dental practice. 800 calls/month, average 4 minutes per call, 60 % talk time on the agent side, 40 % on the user side. That gives roughly 800 × 4 = 3 200 minutes of conversation. Audio input cost: 3 200 × 0.4 × $0.06 = $77. Audio output cost: 3 200 × 0.6 × $0.24 = $461. Without caching: $538/month. With prompt caching applied to the system instructions and the tool schemas (cached at 80× less): the input cost drops by roughly 90 % to $8, total $469. Add ~$30 for telephony (Twilio or OpenAI SIP), $50 for the LiveKit/SFU layer if you self-host, and you land around $550–$600 per month all-in.
Per-call economics: at $550 / 800 calls = $0.69 per call. For a dental practice paying a human receptionist roughly $20-25/hour to cover the same after-hours slot, the agent pays for itself if it handles even 20 % of out-of-hours volume.
Cross-over math vs Vapi/Retell. Vapi’s headline is $0.05–$0.11 per minute platform fee, but the all-in cost (with their preferred LLM, TTS, STT bundle) lands at $0.20–$0.35 per minute, similar to ours. Retell is more transparent at $0.07/min platform plus $0.006–$0.06 LLM. Below ~10 000 minutes/month the platform fees + simpler operations make Vapi/Retell cheaper. Above ~10 000 minutes/month, an OpenAI Realtime + LiveKit Agents stack beats both because you eliminate the platform layer entirely — the underlying token cost is the same, but you keep the platform margin.
Interruption handling, barge-in and turn detection
A voice agent that does not handle interruptions feels robotic in 30 seconds and unbearable in three minutes. Three concepts you have to engineer correctly: voice activity detection (VAD), turn detection, and barge-in cancellation.
1. Voice activity detection. The base layer — is the user speaking right now? The Realtime API offers server-side VAD with a configurable threshold. Lower the threshold and false positives spike (the agent thinks the user is talking when they cleared their throat). Raise it and the agent waits too long before responding. We tune around 0.5 in our deployments and instrument the false-positive rate via the audit log.
2. Turn detection. Has the user finished their turn, or did they just pause? Naive VAD with a 500 ms silence window cuts off slow speakers; a 1.5 s silence window feels laggy to fast speakers. Semantic turn detection — using the model’s own listening to predict when the user is done — is more robust but adds 100–200 ms latency. AssemblyAI, LiveKit and OpenAI all now offer semantic turn detection; we use it on customer-facing deployments and tolerate the latency hit.
3. Barge-in cancellation. When the user starts speaking while the agent is mid-sentence, the agent must stop talking immediately, discard the in-flight audio buffer, and listen. The Realtime API supports a response.cancel event that does this server-side; the client must also stop playing the buffered audio it received over the wire. Failing to clear the client buffer is the most common interruption bug we fix on prototypes.
Tool calling and side-effect orchestration
A voice agent that cannot do anything but talk is a demo. A production agent calls tools — book appointments, look up customers, run charges, query knowledge bases. Three rules from production:
1. Tools are deterministic, idempotent, and timed. The model may retry a tool. The model may call the same tool twice in a row by accident. The model may call your tool with a slightly malformed argument. All three happen in production every week. Your tool implementation must validate every argument, return well-defined errors, and complete in under 800 ms or stream a “working on it” status the agent can speak.
2. Pre-load common context. If the user is logged in, the agent already knows their name, account ID and recent activity. Inject that into the system instructions at session start; do not make the model call a get_user_info tool on every greeting. This single optimisation cut latency by 600 ms on a sales coach we shipped.
3. Audit every tool call. Compliance, debugging, and product analytics all need the trail. Log argument JSON, response, latency, and the model’s decision context. Helicone or LangSmith make this nearly free; rolling your own takes a sprint and is worth it for HIPAA workloads.
The HIPAA catch nobody mentions
As of May 2026, the OpenAI Realtime API audio modality is not covered under OpenAI’s or Microsoft Azure’s standard Business Associate Agreement (BAA). The text-based Azure OpenAI service is HIPAA-eligible. The audio in / audio out path is not. Per Microsoft’s own guidance: “The Realtime API audio modality is currently not on the HIPAA-eligible service list. Do not transmit or process PHI through it until it is formally added.”
If you are building a healthcare voice agent, you have three options:
1. Hybrid pipeline. Run audio through a HIPAA-eligible STT (Azure Speech with BAA, Google Cloud Speech-to-Text with BAA, or AWS Transcribe Medical), pass redacted text to GPT-4 (Azure with BAA), then back to a HIPAA-eligible TTS (Azure Speech, Google TTS, AWS Polly). You lose the speech-to-speech latency benefit but you keep BAA coverage. This is the architecture we used for our 2025 telemedicine agent.
2. Self-hosted everything. Run a HIPAA-eligible STT/TTS pair (Whisper.cpp self-hosted in your VPC, Coqui or XTTS for voice) plus an Azure-OpenAI-on-private-endpoint LLM. End-to-end latency is higher (1.2–1.8 s typical) but you control every byte and the BAA chain is intact.
3. Wait, with a non-PHI agent in the meantime. If your use case is appointment scheduling for a clinic but the conversation never references diagnoses or symptoms, you can launch with the Realtime API and update your privacy notice to make clear PHI is not captured. We have done this for two dental practices — the agent books and reschedules appointments without ever discussing medical content.
For our telehealth clients we now default to the hybrid pipeline. Our HIPAA-compliant video platform guide covers the BAA architecture in depth, and the same patterns apply to voice agents.
OpenAI Realtime vs Vapi vs Retell vs LiveKit Agents
In 2026 there are four serious paths to a voice agent. Here is how they actually compare on cost, control, and time to ship.
| Option | All-in cost | Latency | Time to ship | When to pick it |
|---|---|---|---|---|
| OpenAI Realtime + LiveKit | ~$0.30/min | 800 ms | 6–10 weeks | >10k mins/mo, want full control, video+voice |
| Vapi | $0.20–$0.35/min | 900–1100 ms | 2–4 weeks | Validating quickly, <10k mins/mo |
| Retell | $0.13–$0.18/min | 900–1100 ms | 2–3 weeks | Transparent pricing, telephony-first |
| Pipecat (chained, self-host) | ~$0.15/min | 1.0–1.5 s | 8–12 weeks | Custom voice cloning, HIPAA, multi-vendor LLM |
A practical heuristic: if you are below 10 k minutes/month and validating product-market fit, ship on Vapi or Retell. Move to OpenAI Realtime + LiveKit when (a) you cross volume, (b) you need control over the SFU layer (video + voice in one room), or (c) you want to embed the voice agent inside a larger platform you already own.
Mini case — an AI receptionist in 8 weeks
A US-based dental group with seven practices came to us in late 2025 wanting to replace their after-hours answering service. The service was costing $4 800/month, missing roughly 22 % of calls during peak holiday windows, and producing zero structured data — appointments were captured on paper and rekeyed in the morning.
The 8-week build. Weeks 1–2: discovery, conversation flow design, integration mapping (NexHealth booking system, Twilio for the phone number). Weeks 3–4: agent runtime on LiveKit Agents calling the OpenAI Realtime WebSocket, three tools (find_appointment_slots, book_appointment, escalate_to_human). Weeks 5–6: prompt iteration, accent and edge-case testing with synthetic call data. Weeks 7–8: shadow deployment behind the existing service, then full cutover with a 4-hour business-hours fallback.
Outcome at 90 days. 1 100 calls/month handled, 86 % containment rate (calls completed without human handoff), 3.4-minute average call. Per-call cost dropped from ~$4 (answering service amortised) to $0.71. Patient satisfaction score (post-call SMS survey) rose from 3.7/5 (legacy service) to 4.4/5 (agent). The dental group is now rolling out to two additional practices and has asked us to add a follow-up call agent for outstanding insurance balances. Want a similar build? Book a 30-minute scoping call and we will sketch the equivalent for your industry.
A decision framework — pick gpt-realtime in five questions
Q1. Does the conversation involve PHI or other regulated data? If yes, the Realtime audio modality is currently off the table for the audio path. Use the chained pipeline with HIPAA-eligible STT/TTS, or scope the agent so it never touches PHI. Do not pretend the BAA gap does not exist.
Q2. What volume are you targeting in 12 months? Below 10 k minutes/month, Vapi or Retell ship faster and the cost difference does not matter. Above 10 k, OpenAI Realtime + LiveKit pays back the additional integration time within a quarter.
Q3. Where is the user? Browser, mobile app, or phone? Browser/app → WebRTC transport. Phone → SIP (or Twilio + WebRTC). The transport choice cascades through the rest of the architecture.
Q4. How tightly does voice need to integrate with video? If the agent is part of a video meeting (sales coach, AI moderator, lecture assistant), you almost certainly want LiveKit or mediasoup as the SFU and the agent as a participant in the room. If the agent is voice-only on a phone line, a thinner stack is fine.
Q5. Custom voice cloning? If yes, the chained pipeline with ElevenLabs or Cartesia wins because the Realtime API does not yet expose voice cloning. If you can live with one of OpenAI’s catalog voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer, Marin, Cedar), Realtime stays the best choice.
Pitfalls to avoid
1. Skipping echo cancellation on the client. Without native AEC, the agent hears its own audio coming back through the user’s mic, treats it as user speech, and barges in on itself. Always use the platform AEC (WebRTC’s default in browsers, the platform AudioSession on iOS/Android). We have spent two days debugging “the agent talks over itself” only to find AEC was disabled in the audio constraints.
2. Letting the conversation context grow unbounded. Long sessions slow down (latency drift, §6) and cost more (each turn re-includes prior context). Rotate sessions, prune conversation items, or summarise old turns into a compressed system message every N turns.
3. Treating tools as untrusted only at the schema layer. The model can produce technically valid JSON arguments that are semantically wrong. Validate at the business-logic layer too — appointment dates in the past, payment amounts above a sanity threshold, customer IDs from the wrong tenant. We add a pre-execution policy check on every tool that touches money or PII.
4. Forgetting to redact in the audit log. Capturing every transcript and every audio file is great for debugging. It is also a GDPR/HIPAA breach if those logs include PII or PHI and your retention policy is “forever.” Set a 30-day retention by default; redact emails, phone numbers, card numbers, and SSNs at write time.
5. Believing the demo is the production. A 5-minute demo on a corporate fibre line at 9am Monday looks great. The same agent at 7pm Friday on a customer’s 4G connection in a noisy car is a different product. Test with synthetic calls under packet loss, jitter, and background noise before launch.
KPIs to measure
Quality KPIs. Voice-to-voice latency p50 (target: < 800 ms) and p95 (target: < 1.4 s). Word error rate on the agent’s STT (target: < 8 % on clean audio, < 15 % on noisy). Tool-call success rate (target: > 96 %). Hallucination flags — the rate at which the agent invents facts not in its tool responses (target: < 1 %, instrumented via a periodic sample).
Business KPIs. Containment rate — percentage of calls completed without human handoff (target: 75 %+ for routine support, 85 %+ for appointment booking). Average handle time. Cost per containment (cost/min divided by containment rate). Customer satisfaction score from the post-call survey.
Reliability KPIs. Session establishment success rate (target: 99.5 %). Mean time to recovery on transient OpenAI errors (target: < 2 s, measured as the gap between error and the agent picking up the conversation). Audit log delivery success rate (target: 100 %; missing logs are a compliance failure regardless of customer impact).
When NOT to use OpenAI Realtime
Strict cost ceiling under $0.10/min. Even with caching, all-in production cost lands around $0.25–$0.35/min on Realtime. If your unit economics demand < $0.10/min, the chained pipeline with Whisper (self-hosted), GPT-4o-mini text, and a cheaper TTS (Cartesia, OpenVoice) is the only path. You give up the speech-to-speech latency and prosody, but the model cost halves.
Multi-vendor strategy is a hard requirement. If procurement requires you to keep options open between OpenAI, Anthropic, Google, and others, do not lock the audio path into Realtime — the chained pipeline lets you swap LLMs without rewriting the audio surface. Use Pipecat or LiveKit Agents with a model-agnostic LLM call.
HIPAA-grade healthcare audio. Already covered above — the audio modality is not on the BAA list as of May 2026. If you are sure your conversation never touches PHI you can ship; otherwise, defer to the hybrid pipeline.
Fully on-prem deployment. The Realtime API is a hosted service. If you must run the LLM inside your own data centre (defence, banking, certain government tenants), you cannot use Realtime; consider a self-hosted Llama/Qwen with a chained pipeline.
Voice agent vendor in the wrong tier for your scale?
We will run the cross-over math for your call profile and tell you straight whether to stay on Vapi/Retell or migrate to OpenAI Realtime + LiveKit. No follow-up sales call unless you want one.
FAQ
Is the OpenAI Realtime API HIPAA-compliant?
As of May 2026, no. Microsoft and OpenAI’s Business Associate Agreements cover Azure OpenAI text endpoints, but the Realtime API audio modality is explicitly not on the HIPAA-eligible list. If you are building a healthcare voice agent, use a chained pipeline with HIPAA-eligible STT and TTS (Azure Speech, Google Cloud STT/TTS, AWS Transcribe Medical) plus the text LLM under BAA, or scope your agent to never touch PHI.
What is the actual cost per minute in production?
Headline pricing is $32/M input + $64/M output audio tokens, which works out to roughly $0.30/min for a typical 60/40 agent/user split. Apply prompt caching to your system instructions and tool schemas and the input portion drops by ~90 %. Add infrastructure (LiveKit/SFU, telephony if applicable) and you typically land between $0.25 and $0.35 per minute all-in, depending on how chatty the agent is.
Can I use OpenAI Realtime with my own SIP/PSTN setup?
Yes. Two paths. Path A: OpenAI’s native SIP integration (in beta) connects a phone number directly. Good for low-volume, simple voice agents; lacks routing, recording, DID-pool features. Path B: Twilio or Telnyx terminates the PSTN call into your LiveKit room, and the agent participates as a normal WebRTC client. This is the production pattern for any contact-centre-class workload.
How do I keep latency low in long sessions?
The Realtime API exhibits progressive latency drift in long-running sessions — documented in OpenAI’s developer forum and confirmed in our production data. Two mitigations: (1) rotate sessions every 8–12 turns by reseeding context into a fresh session; (2) prune conversation history server-side via the conversation.item.delete event, keeping only the last few turns plus a summary of older ones.
When should I pick Vapi or Retell instead?
When you are below 10 000 minutes per month, when your team is under three engineers, when you need to ship in 2–3 weeks, or when your voice agent is incidental to your core product. The platform fee buys you orchestration, telephony, monitoring, prompt-flow tooling and 24/7 ops — none of which is free to build yourself. Above 10 k mins/month or with custom integration needs, the OpenAI Realtime + LiveKit path catches up and pulls ahead.
How do I handle interruptions properly?
Three things. First, enable native echo cancellation on the client — without it, the agent hears its own voice and barges in on itself. Second, send response.cancel the moment you detect user speech, so the model stops generating. Third, clear the client’s buffered audio — the model has often streamed several hundred ms ahead of what the user has heard, and that buffer needs to be discarded immediately.
Does it support custom voice cloning?
Not as of May 2026. The Realtime API exposes a fixed catalogue of voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer, Marin, Cedar). If voice cloning is a hard requirement, use the chained pipeline with ElevenLabs (best quality), Cartesia (best latency) or self-hosted XTTS (no per-minute fees but quality trade-offs).
How long does a production build typically take?
For a focused use case (AI receptionist, sales coach, appointment booking), 6–10 weeks with one senior engineer + one mid-level engineer + part-time PM. Discovery + design is week 1–2; agent runtime + first end-to-end call is week 3–4; iteration on tone, edge cases, tool reliability is week 5–7; shadow deployment + cutover is week 8–10. Healthcare with the chained pipeline adds 2–3 weeks for the BAA-eligible audio plumbing.
What to Read Next
SDK
LiveKit AI Agents Playbook
The companion guide for the SFU layer that pairs with the OpenAI Realtime model.
Voice AI
Voice AI Agents on LiveKit
Architecture patterns for embedding voice agents inside a video meeting platform.
Architecture
How AI Agents Work With WebRTC
The transport layer pattern that underlies every modern voice-AI stack.
Compliance
HIPAA-Compliant Video Platforms
The BAA architecture that wraps your voice agent for healthcare deployments.
Voice AI
AI Call Assistants API Guide
Sister article on call-assistant patterns and the alternative API surfaces.
Ready to ship a voice agent in 8 weeks?
gpt-realtime is the right pick when latency and prosody matter, your volume justifies the operations layer, and your data does not include PHI. Below 10 k minutes/month or with HIPAA in scope, the answer changes — sometimes Vapi or Retell, sometimes a chained pipeline with HIPAA-eligible STT/TTS. The five-question framework above will tell you where you sit.
The architecture is well understood at this point: client, SFU, agent runtime, model API, tool plane, observability. Most of the engineering risk is in the boundaries between them — barge-in cancellation, tool latency, session drift — and we have walked through each above. What remains is execution.
Want a voice agent that lands in 8 weeks instead of 8 months?
Send us your call profile, compliance constraints, and target metrics. We will return a one-page architecture, fixed-fee scope, and an 8-week plan within 48 hours, free.



.avif)

Comments