Blog: AI Call Assistants: A Practical Guide to Third-Party APIs for Business Software

An AI call assistant is a voice agent that answers or places phone calls, listens over a telephony line, runs a short language-model turn, and speaks back — all in under a second. In 2026 the market has consolidated around a few serious platforms, sub-$0.25/minute economics, and a compliance regime that actually bites (FCC TCPA rules on AI voices, EU AI Act Article 50 disclosure obligations live August 2, 2026). This guide is a buyer’s playbook: what to pick, what to wire, what breaks in production, what it costs, and where to put a human in the loop.

Fora Soft has shipped voice assistants and chatbots for LMS, fintech, healthcare, and telecom since the first GPT-3.5 wave. We’ve integrated Deepgram, ElevenLabs, OpenAI, Dialogflow, LiveKit, Twilio, and Azure Communication Services into production software, and we built our delivery playbook around the pitfalls the blog posts never mention: echo on PSTN, hallucinated bookings, STIR/SHAKEN attestation, and the 300 ms latency budget you blow as soon as you add one misplaced tool call. Read this top to bottom and you’ll know which API to shortlist, how to architect it, and where a specialist engineering team still pays for itself.

Key takeaways

The stack is modular, not monolithic. Telephony, STT, LLM, TTS, and orchestration are separate layers — the “platform” you pick mostly decides which of them are locked in.

Latency, not quality, is the buying criterion in 2026. Voice-to-voice must land under 800 ms for a natural feel. Everything else is tuning.

All-in cost is $0.13–$0.33 per minute. Vendor “$0.05/min” orchestration prices exclude STT, LLM, and TTS. Plan your math on $0.20/min.

Compliance is the hidden rewrite. FCC Feb 2024 TCPA ruling makes AI-generated voices illegal in unconsented robocalls; EU AI Act Article 50 disclosure is binding Aug 2, 2026; HIPAA and two-party-consent states add regional layers.

Pick the platform last. Start from use case, volume, languages, and compliance envelope — then Vapi, Retell, Deepgram Voice Agent, LiveKit + bring-your-own-stack, or an enterprise path (Twilio, Azure, Google CCaaS) falls out naturally.

Why Fora Soft wrote this playbook

We’ve shipped voice AI in three environments where the margin for error is small: tele-health consultations (HIPAA-audited), financial-service call centers (two-party-consent states), and multilingual support desks (RU/EN/DE code-switching, often on the same call). Across those engagements we kept a running log of what actually breaks in production — barge-in failures on G.711, DTMF lost across transcoders, LLM tool calls blocking TTS, and European regulators asking for the disclosure transcript six months after a deployment.

This guide distills that log. It assumes you already know your business case (inbound support, outbound qualification, IVR replacement, appointment scheduling, or agent-assist copilots) and want a technical buyer’s map to pick a stack that will still compile in 18 months. If you’d rather skip the comparison and talk architecture, book a 30-minute review with our AI voice lead and bring a call recording.

Need a neutral second opinion on Vapi, Retell, or a build-your-own stack?

We’ll review your call flow, latency budget, and compliance envelope in 30 minutes and tell you which stack fits — no sales pitch.

Book a 30-min call →

What an AI call assistant actually is in 2026

An AI call assistant is a real-time conversational agent that terminates or originates PSTN/SIP audio, streams it through speech-to-text, feeds the rolling transcript into an LLM with tools, renders the LLM reply via text-to-speech, and returns it to the caller — all under roughly 800 ms end-to-end. The newer generation (OpenAI Realtime, Gemini Live, Azure Voice Live) collapses STT, reasoning, and TTS into a single speech-to-speech model with around 300 ms time-to-first-byte from US endpoints, shifting latency from orchestration code into the model provider.

In practice buyers are looking at three product shapes. Inbound answering agents deflect support volume and do warm transfer to humans for escalation. Outbound campaign agents qualify leads, book meetings, or run collections — a category the FCC reshaped with the February 2024 TCPA ruling that classified AI-generated voices as “artificial or prerecorded” and made them illegal for unconsented robocalls. Agent-assist copilots listen on human agent calls and whisper suggestions — the lowest-risk use case and a common first deployment.

If you’re only familiar with the 2023 IVR-bot category, the delta is dramatic: turn latency is down 5–10×, ASR word error rate on noisy phone audio is under 6% with Deepgram Nova-3, and emotion-aware TTS (ElevenLabs v3, Hume EVI) cleared the uncanny valley for short interactions. Buyers who built on 2023 stacks and are lifting into 2026 typically rip out half of the orchestration code as a side effect of the upgrade.

The market snapshot: size, growth, and real deployments

Precedence Research puts the call-center AI market at $3.23B in 2024, $3.98B in 2025, heading for $25.84B by 2034 at a 23.11% CAGR. Gartner projects conversational AI in contact centers will save $80B in labor expense by 2026. These are big-number forecasts, so they’re most useful as a sanity check that you’re not betting on a dying category — which you’re not.

The number we point buyers at instead is Klarna’s early 2024 disclosure: their AI assistant handled 2.3 million customer service chats in its first month, about two-thirds of volume, with resolution time dropping from 11 minutes to under 2 and a 25% reduction in repeat inquiries. Klarna attributed the equivalent of roughly 700 FTEs to the agent and $60 M in annual savings. That’s the deployment shape we see working for mid-market and enterprise buyers — and that’s the reason every shortlist we build in 2026 starts with “can this stack handle 70% of call volume at equal-or-better CSAT.”

On the voice side specifically, the “platform” layer (Vapi, Retell, Bland, Synthflow, ElevenLabs Conversational AI) has taken the template-and-prototype segment. Model-provider-native voice (OpenAI Realtime, Gemini Live, Azure Voice Live) is gaining share for teams that already run on those clouds. And the enterprise CCaaS stack (Twilio + Autopilot, Amazon Connect + Lex, Google Dialogflow CX + Gemini, Microsoft Azure Communication Services) still owns regulated industries where an SLA and a procurement path matter more than the last 150 ms of latency.

The API shortlist for 2026

Twelve APIs and platforms matter in 2026. Below is the shortlist we use when scoping a build.

Voice-agent platforms (orchestration + telephony bundled)

1. Vapi. Orchestration-focused platform with a visual flow builder, bring-your-own STT/LLM/TTS, and strong barge-in. Headline $0.05/min orchestration; all-in $0.13–$0.31/min once you add components. Good default if you want flexibility and are comfortable wiring your own components.

2. Retell AI. Tighter turnkey bundle at around $0.07/min all-in ($0.05 enterprise), with a first-party LLM routing layer and strong templates for outbound. If Vapi feels too DIY, Retell is the next step up.

3. Bland AI. Bundled platform at around $0.09/min plus monthly minimums, built for US outbound at volume. Strong call-queue management; weaker European voices.

4. Synthflow. No-code builder that appeals to non-engineers. We use it for internal tools and one-off pilots; not our pick for production call volume.

5. ElevenLabs Conversational AI. End-to-end product wrapped around ElevenLabs’ voices. Pricing $0.08–$0.24/min depending on voice tier. Pick it when voice quality is the single most important feature (brand voice, IVR replacement for consumer luxury).

Model-provider speech-to-speech APIs

6. OpenAI Realtime API. GPT-4o / GPT-5 voice with ~500 ms time-to-first-byte from US; median turn latency around 2.2 s across 30-turn calls in independent benchmarks. Billed by audio-token; no bundled telephony — you wire LiveKit or Twilio Media Streams.

7. Google Gemini Live. Multimodal speech-to-speech on Google Cloud. Native in Dialogflow CX for CCaaS scenarios; strong multilingual.

8. Deepgram Voice Agent API. Bundled ASR + LLM + TTS on Deepgram’s own Nova-3 (5.26% batch WER) and Aura TTS — no pass-through surprises. The enterprise favorite when you need predictable billing.

9. Azure Voice Live. Microsoft’s speech-to-speech offering, tightly wired into Azure Communication Services and Azure OpenAI. Pick it if you’re on Microsoft procurement or need Teams integration.

Enterprise contact-center platforms

10. Twilio (Voice + Autopilot / Conversational Intelligence). Full telephony stack with mature SIP, STIR/SHAKEN, and carrier SLAs. Effective all-in cost around $0.141/min for AI-agent flows on 10k-minute workloads.

11. Amazon Connect + Lex + Bedrock. AWS-native CCaaS; pay-per-minute PSTN plus service fee. Strong fit for teams already on AWS with regulated data.

12. PolyAI and Cognigy. Enterprise-only conversational platforms with in-house design teams. When procurement wants a single vendor who carries the SLA end-to-end.

Reach for a platform (Vapi / Retell) when: you want to ship in 4–8 weeks, you’re comfortable on a single integration seam, and call volume is under 1 M minutes/month.

Reach for model-native speech-to-speech (OpenAI / Gemini / Azure Voice Live) when: latency is the single most important metric, you want fewer network hops, and you already run on the provider’s cloud.

Reach for an enterprise CCaaS (Twilio / Amazon Connect / Azure ACS) when: you’re in a regulated industry, you need carrier SLAs and STIR/SHAKEN attestation, and procurement won’t sign a startup contract.

Reach for build-your-own (LiveKit + Deepgram + OpenAI + ElevenLabs) when: none of the platforms fit your latency budget or custom tooling surface — typically healthcare, defense, financial, or anything with an on-prem requirement.

Comparison matrix — what you actually pay and ship

All figures below are the all-in effective rate we see in deployments (telephony + STT + LLM + TTS + orchestration), not the vendor’s headline rate. Voice-to-voice latency is the median on a warm US connection.

Platform All-in $/min Latency Best for Watchouts
Vapi $0.13–$0.31 ~900 ms BYO stacks, rapid prototyping Price varies by components
Retell AI $0.07 (enterprise $0.05) ~800 ms Outbound, tight bundled flow Less flexible than Vapi
ElevenLabs Conv. AI $0.08–$0.24 ~850 ms Brand voice, consumer IVR Voice cost at premium tier
OpenAI Realtime $0.15–$0.40 ~500 ms TTFB Low latency, tool-call heavy Need BYO telephony
Deepgram Voice Agent $0.12–$0.22 ~700 ms Predictable billing, high-accuracy STT TTS voice library smaller
Twilio Voice + Autopilot ~$0.141 ~1,100 ms Carrier SLA, STIR/SHAKEN Slower iteration loop
BYO LiveKit stack $0.15–$0.25 ~600 ms On-prem, custom tooling, HIPAA Highest engineering cost

The reference architecture (six layers, one latency budget)

Every production AI call assistant we’ve shipped follows the same pipeline, regardless of vendor:

Caller PSTN/SIP → Tier-1 carrier (Twilio / Telnyx / Vonage / SignalWire)
                → Media server (LiveKit Cloud / FreeSWITCH / Asterisk)
                → STT stream (Deepgram Nova-3 / Whisper-v3 / AssemblyAI Universal-2)
                → LLM turn (GPT-5 / Gemini 2.5 / Claude 4.5 + tool-calling)
                → TTS stream (ElevenLabs v3 / Cartesia Sonic / OpenAI TTS / Aura-2)
                → Back to PSTN/SIP

Latency budget (voice-to-voice target = 800 ms):
  STT first-partial          120 ms
  LLM turn (with tools)      450 ms
  TTS first chunk            130 ms
  Network / media server     100 ms
                             =====
                             ~800 ms

Two design choices dominate the architecture. First, streaming everywhere: don’t buffer to a complete utterance before sending; ship STT partials into the LLM and ship LLM tokens into TTS as they arrive. Second, one media server: don’t cross two WebRTC peers or transcode twice — every extra hop is 40–80 ms and a chance for audio artifacts.

If you want a reference implementation that handles SIP trunking, jitter buffers, and barge-in out of the box, our team has been shipping on LiveKit since 2024 — and you can read the architecture detail in our Building Multimodal AI Agents with LiveKit guide. It covers the same pipeline for agents that also handle video and screen share.

The latency budget — where 800 milliseconds goes

1. Speech-to-text (~120 ms). With Deepgram Nova-3 streaming, first partials arrive in about 100–140 ms. Whisper-v3 is slower (200–300 ms) but higher accuracy on clean speech. Nova-3 multilingual handles code-switching across ten languages on the same call — a real requirement for European and APAC deployments.

2. LLM turn (~450 ms). Most of your budget. Single-turn prompts with no tool call come back in 250–400 ms from GPT-4o or Gemini 2.5. One tool call adds 150–300 ms each. Two tool calls blow the budget, so design for one — or pre-fetch context before the user finishes speaking.

3. Text-to-speech (~130 ms). ElevenLabs v3 streaming first chunk lands in 120–160 ms; Cartesia Sonic in 90–120. OpenAI TTS is around 200. You can squeeze another 50 ms by preloading the first word before the LLM finishes.

4. Network, media server, jitter buffer (~100 ms). The tax you pay for real-time audio. LiveKit Cloud runs under 100 ms in most regions; a self-hosted FreeSWITCH on the same VPC as your application under 60.

5. Barge-in detection. Not part of the response-time budget, but the difference between a natural call and a robot. You need VAD (voice-activity detection) running on the inbound audio that can cut the TTS mid-sentence when the caller starts talking. LiveKit ships one; Vapi, Retell, and Deepgram Voice Agent all handle it at the platform level.

Latency over 1.2 seconds? We’ll find the 400 ms you’re leaving on the table.

Send us a recording and a trace; our voice engineering team ships a written diagnosis in 48 hours.

Book a 30-min call →

STT leaders and word-error-rate math

Deepgram published a 2,703-file benchmark across nine domains (podcast, meeting, phone, finance, medical, drive-thru, air-traffic, voicemail) where Nova-3 lands at 5.26% batch WER and 54% lower streaming WER than the prior best open benchmark. For phone audio — which is what actually matters here — Nova-3 multilingual reduces batch WER by 34% and streaming WER by 21% relative to Nova-2, and supports 10-language code-switching inside a single call.

OpenAI Whisper-v3 is competitive on clean US-English speech but degrades noticeably on accented or noisy phone audio. AssemblyAI Universal-2 is in the same ballpark as Nova-3 on English and behind on multilingual. Soniox is a niche pick for heavily accented or noisy phone conditions.

Our rule of thumb: if your calls include non-native English speakers or regional accents, switch to Nova-3 multilingual. The WER delta on accented phone audio is large enough to change CSAT meaningfully, and the price difference is marginal.

TTS leaders and the “uncanny valley” test

ElevenLabs v3 leads published pronunciation-accuracy benchmarks (around 82%) and prosody (around 65%), with 70+ languages. For short customer-service turns it clears the uncanny valley for most listeners. The watchout is cost: premium voices run $0.15–$0.24/min on their own.

Cartesia Sonic is faster (90–120 ms first chunk) and slightly behind on voice quality for emotional delivery. It’s the best pick when your latency budget is very tight.

OpenAI TTS (gpt-4o-mini-tts / gpt-realtime voice) is the default when you’re already on OpenAI Realtime — good enough for most support flows; noticeably flatter than ElevenLabs on emotional turns.

Deepgram Aura-2 and PlayHT round out the shortlist. Aura-2 wins on predictable bundled billing. PlayHT has strong emotional and neutral voice variants and is a good pick when brand voice matters and you want per-voice control.

If you’re building in a regulated vertical where disclosure requirements apply (which by August 2, 2026 is the whole EU), keep the voice recognizably synthetic. The EU AI Act Article 50 requires the user to be informed that they’re interacting with AI; a recognizable-but-pleasant voice makes disclosure easier and lawsuits less likely.

Choosing the LLM — latency, tool-calling, and grounding

Voice agents live or die on three LLM capabilities: turn latency, tool-calling reliability, and grounding on short retrieval chunks. As of April 2026 the cluster we default to is GPT-5 (OpenAI), Gemini 2.5 Flash/Pro, and Claude 4.5. Each has a different cost shape and a different tool-calling style.

GPT-5 is the fastest on tool-heavy flows (function calls in parallel) and has the best developer tooling. It’s also the most expensive per minute; for high-volume outbound we drop to GPT-4o-mini or Gemini 2.5 Flash.

Gemini 2.5 Flash is the price/performance leader for simple inbound flows and multilingual support. If you already run on Google Cloud, CCaaS or otherwise, pick it by default.

Claude 4.5 hallucinates less on long-tail business-logic questions — we use it when the agent is making or confirming non-trivial bookings, billing adjustments, or medical-schedule changes. Slightly slower; safer.

Telephony — SIP trunks, STIR/SHAKEN, and why it matters

The telephony layer is where most build-your-own projects fall over. The issues are almost always the same: the wrong codec lives on the trunk (leading to echo), the wrong caller-ID is on outbound (low answer rate), or there’s no STIR/SHAKEN attestation at all (calls marked “likely spam” before they connect).

STIR/SHAKEN is the US caller-authentication standard that the FCC now enforces for any PSTN outbound. If your trunk provider can’t attest your caller-ID, your answer rate drops by 30–60%. Tier-1 carriers (Twilio, Telnyx, Vonage, SignalWire) attest by default; generic VoIP resellers often don’t. This is a technical detail that derails more outbound projects than any model choice.

Codec-wise, the default on US PSTN is G.711 μ-law at 8 kHz. Wideband (Opus/G.722, 16 kHz) improves STT accuracy meaningfully but only works if both legs of the call are wideband — in practice, only on SIP-to-SIP.

Compliance — TCPA, EU AI Act, HIPAA, two-party consent

US — TCPA and STIR/SHAKEN. The FCC’s February 2024 declaratory ruling classified AI-generated voices as “artificial or prerecorded” under the TCPA. Translation: cloning a voice for outbound robocalls without explicit prior written consent is illegal, with statutory damages per call. Your campaign playbook must include consent proofs and cadence limits. STIR/SHAKEN is mandatory for attestation on outbound dial tone.

EU — AI Act Article 50. Binding on August 2, 2026. If the call touches an EU citizen, the caller must be informed at the start that they’re interacting with an AI system. Penalties reach €20 M or 4% of global annual turnover, whichever is higher. Build the disclosure into the greeting prompt and log the recording.

Healthcare — HIPAA. You need a signed BAA with the platform vendor, encryption in transit and at rest on recordings and transcripts, audited access controls, and documented data flow into EHR systems (Epic / Cerner). Retell, Deepgram Voice Agent, Twilio, and AWS Connect all carry BAAs; Vapi and the Tier-1 LLM providers require individual negotiation.

US states — two-party-consent recording. Eleven states (CA, CT, FL, IL, MD, MA, MT, NV, NH, PA, WA) require all parties on a call to consent to recording. Practical default: disclose and obtain consent at the start of every call, regardless of state, and log the timestamp of the consent audio segment.

GDPR. Voice is personal data. Minimize retention of raw audio (we usually delete within 30 days unless contracted otherwise), keep transcripts and PII separately classified, and surface a DSR (data-subject-request) endpoint in your admin tooling from day one.

Cost model — what 30,000 minutes a month actually costs

A typical mid-market deployment is 10,000 calls × 3-minute average = 30,000 minutes per month. On a standard bring-your-own stack (LiveKit + Deepgram Nova-3 + GPT-5 + ElevenLabs v3) the component cost per minute is:

Component $/min Monthly @ 30k min
PSTN / SIP (Twilio) $0.010 $300
LiveKit media $0.005 $150
Deepgram STT $0.013 $390
GPT-5 turn $0.060 $1,800
ElevenLabs TTS $0.090 $2,700
Total all-in $0.178 $5,340

At Retell’s $0.07 enterprise rate the same volume is $2,100/month — a roughly 60% savings at the cost of stack flexibility. Vapi at $0.144 all-in lands at $4,329. Twilio Autopilot-style flows at $0.141 land near $4,200. The delta between the cheapest and most expensive path is roughly $3k/month per 30k minutes — which almost always disappears the moment the expensive path saves you two engineering weeks of orchestration work.

Mini case — HIPAA tele-health, 12-week plan, before/after

Situation. A US tele-health platform with 120 clinicians was handling appointment reminders and pre-call intake via a human call center. Cost per outbound reminder was $1.20, and 38% of calls were never completed because patients hung up on the IVR. They wanted an AI assistant that could handle reminders, reschedule when patients asked, and escalate to a human for anything clinical.

12-week plan. Weeks 1–2: BAA signed with Deepgram Voice Agent and Twilio; HIPAA audit scope defined. Weeks 3–6: integration with Epic via FHIR, call flows for reminder, confirm, and reschedule, human-escalation to the existing agent desk. Weeks 7–9: pilot on 5% of volume, iterating on ASR on accented speakers and barge-in timing. Weeks 10–12: rollout to 100%, two-party-consent disclosure on every call, PHI redaction in stored transcripts.

Outcome. Cost per reminder dropped from $1.20 to $0.34 (72% lower). Completion rate rose from 62% to 87% because the AI agent could answer the “what is this call about?” question that the legacy IVR couldn’t. Rebook rate improved 19 points. Zero HIPAA findings on the first audit. Want a similar assessment for your stack? Book a 30-minute review.

A decision framework — pick the stack in five questions

Q1. What’s your call volume? Under 100k minutes/month lets a platform (Vapi, Retell, Synthflow) carry you. Over that, the engineering cost of a bring-your-own stack amortizes fast.

Q2. Inbound, outbound, or agent-assist? Outbound at volume forces STIR/SHAKEN, consent proofs, and TCPA exposure. Agent-assist skips most of it. Inbound is the middle ground.

Q3. Languages and accents? If you serve non-English speakers or code-switching, Deepgram Nova-3 Multilingual is the default; Whisper-v3 is viable for clean high-resource languages. Monolingual English is much wider open.

Q4. Regulation? HIPAA, PCI, GDPR, EU AI Act, US state two-party consent — each forces vendor choices. BAA-capable vendors are a small subset. Document this first, pick platform second.

Q5. Do you want speech-to-speech or pipeline? Speech-to-speech (OpenAI Realtime, Gemini Live, Azure Voice Live) is faster but less observable. Pipeline (STT → LLM → TTS) is easier to debug, swap, and log — still our default for regulated production.

Five pitfalls that kill AI call deployments

1. Echo on PSTN. G.711 μ-law carries far-end audio back into the near-end mic path. Without active echo cancellation the LLM hears itself and loops. Fix: enable AEC on the media server and validate on every SIP trunk you add — don’t trust the vendor default.

2. Hallucinated bookings. The agent confirms a Tuesday 3pm that doesn’t exist in the calendar. Fix: never let the LLM commit a transaction from its output alone. Always make the tool call, wait for the 2xx, and read the real confirmation back to the caller.

3. Tool-call timeouts. External APIs (CRM, calendar, billing) go slow; the LLM blocks and TTS stalls. Fix: set a hard 700 ms timeout on every tool, and prepare a filler utterance (“one moment while I pull that up”) the agent can speak while waiting.

4. Bad escalation handoff. The human agent picks up cold with no context. Fix: generate a 1-sentence call-context summary on escalation, pass it with the transfer, and play a recording of the last customer utterance to the human in their softphone.

5. Missing consent and disclosure. The first lawsuit on EU AI Act Article 50 will be a greeting that didn’t mention AI. Fix: bake disclosure into the first TTS utterance; log the transcript with timestamp; store for the retention window your regulator mandates.

KPIs — what to measure on day one

Quality KPIs. Word error rate on your call corpus (target < 8%), containment rate (% of calls resolved without human), CSAT on post-call SMS (target ≥ 4.3 / 5), hallucination rate on booked actions (target 0%). Run weekly with 200 random calls human-labeled.

Business KPIs. Cost per resolved call (compare to human baseline), abandonment rate vs legacy IVR (target –30%), conversion rate on outbound (benchmark against human agents), upsell rate on inbound (AI often beats human here for mid-market).

Reliability KPIs. Voice-to-voice p50 and p95 latency (targets ≤ 800 ms and ≤ 1.4 s), tool-call p95 latency, barge-in detection lag (target ≤ 150 ms), error rate on PSTN (SIP 5xx). These are the ones that wake you up at 3am.

Build vs buy — the only useful checklist

Buy a platform (Vapi, Retell, Synthflow, ElevenLabs Conversational AI, Bland) when your call flow is broadly industry-standard, your call volume is under ~500k minutes/month, you have under four weeks to first production traffic, you’re comfortable with vendor lock-in on the orchestration layer, and your compliance envelope is narrow (no HIPAA, no PCI, no on-prem).

Build (LiveKit + Deepgram + OpenAI/Claude/Gemini + ElevenLabs/Cartesia) when you’re over that volume threshold, you’re in a regulated industry, you need custom tool-calling against internal systems (EHR, core banking, dispatch), you want your own observability and swap-able model layer, or your latency budget is under 700 ms voice-to-voice.

Buy an enterprise CCaaS (Twilio + Autopilot, Amazon Connect + Lex, Dialogflow CX + Gemini, Azure Communication Services + OpenAI) when your procurement requires a single Tier-1 vendor, you’re already on their cloud, or your enterprise service desk already runs on their platform and you want the AI layer bolted on rather than rebuilt.

When not to deploy an AI call assistant

Don’t deploy when the call is the most valuable customer interaction. High-touch B2B sales, clinical diagnosis, grief counseling, and high-value account management still outperform AI on retention and NPS — and the reputational risk of a failure is much larger than the savings. For those workloads, an agent-assist copilot is the right level.

Don’t deploy when you can’t instrument. If you can’t measure containment, CSAT, and hallucination rate weekly, the agent drifts silently and you find out via a customer-complaint tweet six weeks later. Observability comes first, deployment second.

Don’t deploy where the regulator hasn’t ruled. Some jurisdictions are still writing AI voice regulation — if the compliance team can’t point at a specific statute or guidance, pilot agent-assist first and wait on customer-facing deployment.

Planning a voice agent rollout in a regulated industry?

Fora Soft has shipped HIPAA, GDPR, and TCPA-compliant voice deployments since 2023. We’ll map your compliance envelope and stack in one call.

Book a 30-min call →

Industries shipping value with AI call assistants in 2026

Healthcare. Appointment reminders, pre-call intake, post-visit follow-up, benefits verification. HIPAA-audited vendors only; BAAs non-negotiable. Tele-health platforms see 60–70% cost-per-call reduction.

Financial services. Inbound account status, collections (heavily regulated), mortgage intake, insurance claims first-notice-of-loss. Recording rules, PCI for card data, state-level consumer-protection layers.

EdTech and LMS. Enrollment, course advising, exam scheduling, attendance. Low regulatory overhead; high multilingual requirement. Good fit for Vapi or Retell if volume is moderate.

Logistics and field service. Dispatch confirmation, appointment windows, reschedule, cancellations. High tool-calling volume against dispatch systems; real-time calendar is the critical integration.

Real estate. Lead qualification, showing-scheduling, tenant screening. Outbound-heavy, so TCPA exposure matters more than in most verticals.

Retail and e-commerce. Order status, returns, post-purchase upsell. Often delivered as omnichannel (voice + SMS + chat) — one reason platforms with multichannel surfaces (Twilio, Intercom Fin, Zendesk’s AI agents) win this segment.

A 12-week deployment playbook

Weeks 1–2. Compliance scoping (HIPAA, GDPR, TCPA, state consent), vendor BAA / DPA signed, one call flow selected, KPIs agreed with the business.

Weeks 3–5. Integration with one backend system (CRM or calendar), disclosure greeting written in every language, tool-calling schema finalized, barge-in tuned.

Weeks 6–8. Pilot on 5–10% of volume, daily calibration against 50 human-labeled calls, latency p95 measurement, escalation handoff dialled in.

Weeks 9–11. Scale to 50% of volume, add a second call flow, implement PII redaction on stored transcripts, run first compliance dry run.

Week 12. 100% rollout, KPI dashboard live, weekly calibration cadence set, post-mortem on the pilot, roadmap for the next two flows.

FAQ

What is an AI call assistant in 2026?

A real-time voice agent that terminates a PSTN/SIP call, transcribes audio with a speech-to-text model, reasons with an LLM (often with tool-calling), and speaks a reply via text-to-speech — all in under 800 ms end-to-end. Modern systems also handle barge-in, escalation to human agents, and multilingual code-switching on the same call.

Which API should I pick — Vapi, Retell, or OpenAI Realtime?

Vapi for bring-your-own-stack flexibility. Retell for a tighter bundled outbound flow at lower all-in cost. OpenAI Realtime for the lowest latency, when you’re comfortable wiring your own telephony (LiveKit or Twilio Media Streams). For regulated industries or high volume we default to a LiveKit + Deepgram + Claude 4.5 + ElevenLabs build.

What does an AI call assistant actually cost per minute?

Vendor headline rates are $0.05–$0.10/minute for orchestration only. All-in (including STT, LLM, TTS, and telephony) the real range is $0.13–$0.33/min depending on model choices. Our standard BYO build (LiveKit + Nova-3 + GPT-5 + ElevenLabs v3) lands near $0.18/min.

Is it legal to use AI voices on outbound calls in the US?

Since the FCC’s February 2024 declaratory ruling, AI-generated voices are “artificial or prerecorded” under the TCPA. That means you need prior express written consent for outbound AI calls to consumers, plus STIR/SHAKEN attestation on your trunk. Without those, you’re exposed to statutory damages per call.

Does the EU AI Act apply to my voice bot?

If the call touches an EU citizen — yes. Article 50 takes effect August 2, 2026, and requires clear disclosure that the user is interacting with an AI system. Penalties are up to €20 M or 4% of global annual revenue. Disclose in the greeting, log the transcript, and keep the recording.

How do AI call assistants handle multiple languages?

Deepgram Nova-3 Multilingual handles 10-language code-switching inside a single call with ~34% lower batch WER than the previous generation. Whisper-v3 supports more languages but is slower and less accurate on phone audio. Most LLMs (GPT-5, Gemini 2.5, Claude 4.5) respond correctly in the detected language without extra configuration.

Can AI call assistants transfer to a human agent?

Yes — with warm transfer as the default. The AI passes a one-sentence context summary plus the last customer utterance to the human agent’s softphone before the handoff completes. This preserves state and stops the “say it all again” problem that kills CSAT on legacy IVR.

How long does an AI call assistant project take?

A basic platform pilot ships in 2–4 weeks. A production-grade bring-your-own-stack build with a single business-system integration and one language takes 8–12 weeks. Multi-language, regulated-industry deployments with multiple escalation paths run 3–5 months.

Architecture

Building Multimodal AI Agents with LiveKit

The reference architecture for voice + video agents we deploy in 2026.

STT

Speech Recognition Accuracy in Noisy Environments

How to squeeze sub-8% WER from phone audio in real deployments.

AI Video

AI Video Streaming App Development in 2026

Protocols, codecs, and recommender engines for AI-powered streaming.

Services

AI Chatbot & Voice Assistant Development

How Fora Soft builds voice and chat agents end-to-end — services overview.

Ready to ship an AI call assistant that actually converts?

The 2026 stack is mature: sub-800 ms latency is achievable with any serious platform, multilingual ASR with Deepgram Nova-3 handles real-world phone noise, and compliance paths for HIPAA, TCPA, GDPR, and the EU AI Act are well-trodden. The decisions that still matter are use case, volume, regulated envelope, and whether speed-to-market or control-over-stack pays off better in your business.

If you’re shipping a platform pilot this quarter, pick Vapi or Retell, wire Deepgram Nova-3 for ASR and ElevenLabs v3 for voice, set your latency target at 800 ms p50, and run a 5% pilot against a human baseline. If you’re shipping into a regulated industry or heading north of a million minutes a month, budget a 10–12-week build on LiveKit with your own STT/LLM/TTS selections and a proper observability stack from day one. Either path gets you to production — the difference is whether you’re renting infrastructure in 18 months or owning it.

Either way, Fora Soft has shipped the pattern you’re about to build. Bring a call recording, a flow diagram, and your compliance envelope; we’ll return with a stack shortlist, a cost model, and a 12-week delivery plan.

Let’s architect your AI call assistant, end to end.

30 minutes with our voice engineering lead: stack, compliance, cost model, and 12-week delivery plan.

Book a 30-min call →

  • Services
    Development
    Technologies