
Key takeaways
• A video AI agent is a real-time pipeline, not a single model. It captures audio + video, runs ASR, vision and LLM reasoning, then acts (replies, summarises, escalates, plays a UI). Production systems orchestrate 5–10 components on a sub-500 ms loop.
• The 2026 reference stack is LiveKit + Whisper + GPT-4o (or Llama 3.3 70B) + ElevenLabs / Cartesia + a vision model. Daily, Vapi and Pipecat sit in the same architectural slot. The full glass-to-glass loop runs at 600–1,200 ms when tuned.
• Latency is the product. Below 800 ms the agent feels human; above 1,500 ms it feels broken. Every architectural choice (streaming ASR, partial-LLM, sentence-boundary TTS) is a latency optimisation.
• The cost curve is brutal but solvable. A naive build runs $0.30–$1.50 / minute on closed APIs; a tuned hybrid (LiveKit self-host + open-weight Llama + Whisper) lands at $0.04–$0.10. Pick the right deployment pattern early.
• Fora Soft has shipped video AI agents in production for 5+ years. Sales-meeting copilots, telehealth triage, real-time interpretation, live commerce hosts. Book a 30-min call.
Why Fora Soft wrote this video AI agents guide
Fora Soft has shipped real-time video and AI products since 2005. We have built sales-meeting copilots on Meetric, real-time interpretation on TransLinguist and VOLO, and we operate stacks that combine LiveKit, Whisper, GPT-4o, Llama 3.3, ElevenLabs and custom vision pipelines.
This guide is the conversation we have when a product manager asks “what does a video AI agent actually look like inside?”. It is opinionated, vendor-neutral and grounded in client work that has to hit sub-second latency and predictable unit economics. Visit our AI integration practice to see the projects this playbook is grounded in.
We use Agent Engineering internally, which is why our build estimates and timelines are typically 30–50 % faster than agencies still doing this by hand.
Want to ship a video AI agent in your product?
We will turn the architecture below into a working prototype on your traffic in 4–6 weeks — with eval set, latency budget and unit economics.
A video AI agent, defined
A video AI agent is software that joins a live audio + video session, perceives what is happening, reasons about it with an LLM, and acts — speaking, summarising, triggering UI, escalating to a human, or all four. It is the streaming-media equivalent of an LLM agent: same loop, harder real-time constraints.
The minimum viable agent has four stages: perceive (audio in, optional vision in), understand (ASR + intent / vision parsing), think (LLM reasoning, tool calls, retrieval), act (TTS out, function calls, UI events). Every commercial product (LiveKit Agents, Daily AI, Vapi, Retell, Pipecat) implements the same shape with different defaults.
Two qualifiers worth setting straight. “Real-time” in this domain means the loop closes in <1 s most of the time. “Multimodal” means more than one input stream — usually audio + video, sometimes audio + screen-share, sometimes audio + UI events. Our LiveKit multimodal agents guide goes deep on the stack itself.
The reference architecture — nine components on a 1-second loop
The picture every team draws on the whiteboard:
1. Real-time transport. LiveKit, Daily, Twilio, Vonage or a self-hosted SFU. Carries audio and video both directions, sub-300 ms one-way. 2. Streaming ASR. Whisper Large v3 (HF), Deepgram Nova-3, AssemblyAI Streaming. Returns partials within 200 ms. 3. Vision frame ingestion. Optional. Periodic frame capture (every 1–5 s) sent to a vision-capable LLM. 4. Turn detection. Detects end of user utterance to trigger LLM. VAD-based, learned, or vendor-managed.
5. LLM core. GPT-4o, Claude 3.5 Sonnet, Llama 3.3 70B, Qwen 3.5, DeepSeek V3.2. Streaming completion, tool / function calling, optional vision. 6. Retrieval / RAG. Vector store + embeddings; pulls product, customer, knowledge-base context into the prompt. 7. Tool calls. Function-call schema for everything the agent can do (look up order, book appointment, send email, escalate).
8. TTS. ElevenLabs, Cartesia Sonic, OpenAI TTS, Deepgram Aura. Streaming output keyed on sentence boundaries to begin speaking before the LLM finishes. 9. Orchestration / state. A small service holding session state, conversation memory, retry / fallback policies. LiveKit Agents and Pipecat both ship this.
The reasoning loop closes in 600–1,200 ms when tuned. ASR partial 200 ms, turn detection 100 ms, LLM first-token 200–400 ms, TTS first-audio 100–200 ms, network plus jitter buffer 200–400 ms. Above 1.5 s the conversation feels broken.
The latency budget — where every millisecond goes
| Stage | Tuned 2026 budget | Lever to pull |
|---|---|---|
| Audio capture & encode | 20–40 ms | Smaller frame size, hardware AAC |
| Transport (one-way) | 100–200 ms | Closer SFU region, WebRTC tuning |
| Streaming ASR (partial) | 150–250 ms | Smaller chunks, real-time models |
| Turn detection | 50–150 ms | Learned VAD over silence threshold |
| LLM time-to-first-token | 200–500 ms | Smaller model, prompt cache, vLLM |
| TTS time-to-first-audio | 100–200 ms | Streaming TTS, sentence-boundary commit |
| Return transport + jitter buffer | 100–200 ms | Adaptive jitter, DTLS-SRTP fast-path |
Aim for <800 ms loop latency. That is the threshold at which a video AI agent feels human. Above 1.5 s users start interrupting each other and trust collapses.
The 2026 reference stack — vendor cheat sheet
| Layer | Closed / managed | Open / self-host |
|---|---|---|
| Real-time transport | LiveKit Cloud, Daily, Twilio, Vonage | LiveKit OSS, mediasoup, Janus |
| ASR | Deepgram Nova-3, AssemblyAI, OpenAI | Whisper Large v3 (HF), faster-whisper |
| LLM | GPT-4o, Claude 3.5 Sonnet, Gemini | Llama 3.3 70B, Qwen 3.5, DeepSeek V3.2 |
| Vision | GPT-4o vision, Gemini 2.0 multimodal | Llama 3.2 Vision, Qwen-VL, MiniCPM-V |
| TTS | ElevenLabs, Cartesia Sonic, OpenAI TTS | Coqui XTTS v2, OpenVoice, F5-TTS |
| Vector / RAG | Pinecone, Turbopuffer | Qdrant, Weaviate, pgvector |
| Orchestration | LiveKit Agents, Vapi, Retell, Daily Bots | Pipecat, smolagents, custom |
| Observability | LangSmith | Langfuse, OpenTelemetry, Grafana |
How a video AI agent differs from a voice bot
Voice agents and video AI agents share most of their stack but diverge in three places. Knowing where to invest stops you from over-engineering one and under-investing in the other.
Vision adds an extra cost and latency dimension. Frames are tokens; multimodal LLMs charge per image. A naive “1 frame / second” setup can dominate the bill before you notice. Plan vision frame gating from day one.
Turn detection is harder. Video lets you exploit visual cues (gaze direction, mouth movement) for end-of-utterance detection. Done well, this is a 50–150 ms latency win over voice-only VAD.
UX expectations are higher. Users see the agent’s avatar or video output; uncanny-valley issues, lip-sync, and visual fillers (typing indicators, “thinking” animations) become product requirements, not engineering footnotes.
Eval and continuous improvement — how to keep the agent getting better
A video AI agent is only as good as the eval set you run it against. The 2026 process most successful teams converge on:
1. Hand-grade 50–200 conversations. Real production transcripts, scored by a domain expert on a 1–5 scale across the dimensions that matter (accuracy, tone, action correctness, safety).
2. Automate the eval with an LLM judge. A second model grades the agent’s responses against the same rubric, calibrated against the human grades. Lets you regression-test on hundreds of conversations per change.
3. Trace everything. Langfuse or LangSmith captures the full conversation, prompt, model output, tool calls and timing. Hallucinations and regressions become specific lines in a trace, not vague tickets.
4. Loop conversations back into the eval set. Every escalation, every “thumbs down”, every customer complaint becomes a new graded example. The eval set grows; quality improves.
5. Gate model swaps on the eval. When GPT-4o becomes GPT-5, when Llama 3.3 becomes Llama 4, run the eval before flipping. The wrong model swap can drop quality 10 % silently.
Five use cases that pay back fastest
1. Sales-call copilots and summaries. Real-time call transcription, action-item extraction, CRM auto-update, post-call summary. The pattern Meetric ships in production. Payback under 6 months for any sales team above 20 reps.
2. Telehealth triage. Symptom intake, vital-sign capture (via vision), pre-visit summary for the clinician, automatic note generation. Cuts clinician overhead 30–40 %; HIPAA via self-hosted Llama.
3. Real-time interpretation and translation. Two-way voice translation in video calls, with sub-second latency. TransLinguist and VOLO show the architecture.
4. Live commerce hosts. Always-on streamer that answers product questions in chat or voice during a live shopping show. Real-time inventory lookup as a tool call.
5. Customer support escalation. AI agent triages, resolves common cases, escalates complex ones to a human with full context attached. Average handling time drops 30–50 % in pilots we have measured.
Start with sales summaries or telehealth triage. Both have clear ROI, low regulatory friction, and produce labelled data that improves the agent quickly.
Cost model — per-minute economics on three real stacks
| Stack | Per-minute cost | Notes |
|---|---|---|
| Closed APIs (Twilio + Deepgram + GPT-4o + ElevenLabs) | ~$0.30–$1.50 | Fastest to ship; highest margin loss |
| Hybrid (LiveKit Cloud + Whisper + GPT-4o-mini + Cartesia) | ~$0.10–$0.25 | Production sweet spot for most teams |
| Self-hosted open (LiveKit OSS + Whisper + Llama 3.3 70B vLLM + XTTS) | ~$0.04–$0.10 | Below 100k min / mo, ops cost dominates |
Two non-obvious facts. TTS is often the largest single line item: ElevenLabs at $0.30 / 1,000 chars adds up faster than the LLM bill on chatty agents. Vision frames balloon the LLM cost: sending a frame every second of a 30-minute call to GPT-4o vision can 5× the LLM bill versus audio-only.
Already running a video AI agent and the per-minute cost feels wrong?
We will model the closed / hybrid / self-hosted variants on your real traffic and tell you which moves margin in 5 working days.
When to send video frames into the agent (and when not to)
Vision is what makes a video AI agent more than a voice agent. It is also the single biggest cost and latency multiplier. Three rules from production work.
1. Send frames sparsely. Most use cases need 1 frame per 2–5 s, not per second. Tie capture to changes (gesture, screen-share switch, document held up) rather than wall-clock time.
2. Pre-process before the LLM. Use a fast vision-only model (CLIP, MoonDream, MiniCPM-V) to cheaply gate which frames make it to the expensive multimodal LLM. Drops 80–90 % of vision-token spend.
3. Use a separate worker queue. Vision should not block the audio reasoning loop. Run it asynchronously and feed the result into the next LLM turn as context, not into the current one.
Compliance, recording and consent
Video AI agents touch every sensitive data law that exists. The four boxes you must tick before launch:
Consent. Explicit, recorded, language-appropriate consent before AI listens, records or speaks. EU GDPR plus US two-party-consent states each have different rules; bake the consent UX into the join flow.
Data residency. If your buyers are in regulated industries, the LLM and ASR endpoints must run in a region they accept. That is the strongest argument for self-hosted Llama / Whisper, often more important than cost.
Recording and retention. Decide what is recorded (audio, video, transcripts, agent reasoning), where it is stored, for how long, and who can access it. Default conservative; expand only with use cases that demand it.
HIPAA / SOC 2 / GDPR. Closed APIs offer BAAs and DPAs, but coverage varies. Self-hosted gives you full control but you carry the certification. Plan compliance before architecture.
Five pitfalls that derail video AI agent projects
1. Optimising the wrong latency. LLM time-to-first-token is the obsession of most teams; in practice the largest swing is turn detection plus jitter buffer. Profile end-to-end before you optimise individual stages.
2. No real eval set. “It feels good” is not a metric. Build a 50–200 graded conversation eval before you ship; gate every model swap on it.
3. Forgetting the human handoff. Every agent eventually needs to pass to a person. The handoff UX (who, when, with what context) is more important than the agent quality.
4. Vision frames flooding the LLM. Frames per second × vision tokens per frame is a number that should be on every dashboard. Treat it like CDN egress.
5. Hallucinations on tool calls. The model invents a function, an order ID, a calendar slot. Use strict JSON schemas, parse the call, and refuse anything that does not match.
A decision framework — pick your stack in five questions
Q1. Below 50k minutes / month? Closed APIs across the board. Time-to-market beats unit economics.
Q2. HIPAA / sovereign cloud? Self-hosted Llama on vLLM in your VPC; Whisper Large v3 in the same cluster.
Q3. Above 100k minutes / month and quality-tolerant of an open model? Hybrid — LiveKit Cloud, Whisper, Llama 3.3 70B on vLLM, ElevenLabs or Cartesia.
Q4. Need vision (gestures, document capture, screen-share parsing)? Add a frame-gating step (CLIP / MiniCPM-V) before the multimodal LLM.
Q5. Hard latency budget <800 ms? Cartesia Sonic or local XTTS for TTS; GPT-4o-mini, Llama 3.3 70B on H100 with prompt cache for the LLM; co-locate transport, ASR and LLM in one region.
KPIs to track once you ship
Quality KPIs. Eval-set pass rate, hallucination rate (sampled human review), tool-call success rate, escalation precision, captions / transcript word-error rate.
Business KPIs. Cost per call-minute, cost per resolved ticket / generated summary, conversion lift versus a non-AI baseline, retention of users who interact with the agent.
Reliability KPIs. P50 / P95 / P99 loop latency, agent join success, mid-call reconnect success, vendor-fallback hit rate, vision frame queue depth.
If you remember nothing else: latency is product, eval is spec, vision is a cost trap, and TTS is the silent budget killer. Get those four right and the rest of the stack falls into place.
Mini case — sales call copilot on Meetric
Situation. Meetric needed a real-time sales-call copilot that produces post-call summaries, surface action items live on the rep’s screen, and write back to the CRM — without leaking customer data to a closed API.
Plan. LiveKit Cloud for transport, Whisper Large v3 (HF) for transcription, Llama 3.3 70B Instruct on vLLM in the customer’s EU AWS account for the LLM, BGE embeddings + Qdrant for retrieval over the customer knowledge base. Eval set of 200 graded summaries built in three days with the customer’s sales lead.
Outcome. 92 % of summaries rated “publish-ready as is”, ~$0.06 / summary versus ~$0.40 on a closed API at the same quality, zero customer call data leaves the buyer’s VPC. Want a similar buildout? Book a scoping call.
When you should not build a video AI agent
Skip the build if (a) your call volume is below 5,000 minutes / month and the marginal value per call is <$1; (b) the regulatory or consent friction is higher than the productivity win (some legal and judicial workflows); (c) you cannot define an eval set the agent has to clear — if you cannot grade it, you cannot ship it.
Conversely, if you have a measurable cost-of-call (sales reps, clinicians, support agents) and consent is straightforward, the payback math is one of the cleanest in 2026 software.
Frequently asked questions
What is a video AI agent?
Software that joins a real-time video session, perceives audio and (optionally) video, reasons with an LLM, and acts — speaking, summarising, calling tools or escalating to a human. Real-time means the loop closes in <1 second.
What latency should a video AI agent achieve?
Below 800 ms loop latency is the threshold at which the agent feels human. Tuned production stacks land at 600–1,200 ms; above 1.5 s users feel the awkwardness.
How much does a video AI agent cost per minute in 2026?
$0.30–$1.50 / minute on naive closed-API stacks, $0.10–$0.25 on hybrid stacks (LiveKit Cloud + GPT-4o-mini + Cartesia), $0.04–$0.10 on tuned self-hosted stacks (Llama on vLLM, Whisper, XTTS).
Which orchestration framework should I pick?
LiveKit Agents is the strongest open framework in 2026 and the foundation of products like ChatGPT Voice. Daily Bots and Pipecat are credible alternatives. Vapi and Retell are good for outbound voice-only; for video, LiveKit or Daily.
Do I need vision in the agent or is voice enough?
Voice is enough for sales summaries, support and most contact-centre use cases. Vision matters when the user shares a document, performs a gesture, demonstrates a problem with their environment (telehealth, remote field support) or screen-shares software. Add it deliberately, with frame gating.
Can a video AI agent be HIPAA compliant?
Yes — via self-hosted Llama or Qwen on vLLM in your VPC, Whisper Large v3 on the same cluster, and a HIPAA-eligible transport (LiveKit self-hosted, Daily, Vonage with BAA). Closed APIs may also work where BAAs are explicit.
How long does a production build take?
A useful prototype takes 2–4 weeks. A production build with eval set, observability, fallback paths and compliance review is 8–14 weeks. Fora Soft typically ships 30–50 % faster using Agent Engineering on the boilerplate-heavy parts.
Does Fora Soft build video AI agents?
Yes. We have shipped video AI features on Meetric, TransLinguist, VOLO and other live products. We typically scope a video AI agent in 30 minutes and deliver a fixed-scope prototype in 4–6 weeks. Book a call.
Ready to scope a video AI agent for your product?
A 30-minute call, a written architecture and unit-economics plan within 5 working days, and a fixed-scope prototype quote.
Treat the eval set as the product spec. Anything you cannot grade, you cannot ship. Anything you can grade, you can iterate against in days.
The 2026 tooling ecosystem at a glance
A lot of names move quickly in this space. The shortlist worth tracking:
Agent frameworks. LiveKit Agents (Python, Node, Go), Pipecat (Python), Daily Bots, Vapi, Retell, smolagents, OpenAI Realtime API.
Inference servers. vLLM (production default), SGLang (RAG-heavy), TensorRT-LLM (peak NVIDIA throughput), llama.cpp (CPU / edge).
ASR. Whisper Large v3, Deepgram Nova-3, AssemblyAI Streaming, Speechmatics, NVIDIA Parakeet.
TTS. ElevenLabs, Cartesia Sonic, OpenAI TTS, Deepgram Aura, Azure Neural, Coqui XTTS v2.
Observability. LangSmith, Langfuse, Helicone, OpenTelemetry traces, Grafana Loki for logs.
What to read next
Voice AI
LiveKit voice AI agents in 2026: the engineer’s playbook
The voice-only sibling of this guide; same architecture, simpler stack.
Multimodal
2026 LiveKit multimodal agents guide: voice, vision & production
A deeper architectural reference for production multimodal agents.
AI APIs
AI call assistants — a practical guide to third-party APIs
When the stack above is overkill and a turnkey API moves faster.
Open-source AI
Hugging Face for business in 2026
The Hub, libraries and managed compute behind every self-hosted agent.
Ready to ship a video AI agent?
A video AI agent in 2026 is no longer a research project. The architecture is settled (transport + ASR + LLM + TTS + orchestration + observability), the latency budget is achievable in production (600–1,200 ms), and the unit economics are workable on a hybrid open / closed stack ($0.10–$0.25 / minute).
The right move depends on your call volume, compliance and quality bar. Pick the closed stack to validate, the hybrid to scale, and the self-hosted path when volume or compliance demands it. Our AI integration practice ships exactly this loop end to end.
Get a video AI agent roadmap tailored to your product
A 30-minute call, an architecture and unit-economics plan within 5 working days, and a fixed-scope prototype quote.


.avif)

Comments