Published 2026-06-01 · 16 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you are a product manager, founder, or engineering lead deciding whether to add live captions, a virtual background, a voice agent, or an in-call copilot to your video product, the single question that decides feasibility is latency. A feature that feels magical at 90 milliseconds feels broken at 600. The trouble is that latency is additive and unforgiving: every stage spends part of one shared budget, and once it is gone there is no recovering it downstream. This article hands you the budget, the per-stage costs, and the decision rules so you can have a precise conversation with your engineers instead of discovering the problem in user testing.
What "real-time" actually means to a human
Engineers throw the phrase "real-time AI" around as if it were one thing. It is not. The term real time ai describes any system that produces a result while the input is still arriving, fast enough that the result is useful in the moment. The hard part is the phrase "in the moment," because the moment is defined by human perception, not by your server logs.
Three numbers from the research on human perception set the boundaries, and they are worth memorizing.
The first is the conversational turn-taking gap. When two people talk, the typical silence between one person finishing and the next person starting is about 200 milliseconds, and a gap shorter than roughly 120 milliseconds is not perceived as a gap at all. A pause of 200 milliseconds is the heartbeat of natural conversation. If your AI voice agent answers faster than that, it feels eager; if it answers a full second late, the human starts talking again, and you get the awkward collision everyone has experienced on a bad call.
The second is the response-time ladder from interface research. A system response under 0.1 second — 100 milliseconds — feels instantaneous, as if the user caused it directly. A response around one second keeps the user in flow but is noticeable. Ten seconds is the limit of held attention. There is also the Doherty threshold: when a system answers in under 400 milliseconds, people work measurably faster and engage more, because they stop waiting on the machine.
The third is audio-to-video synchronization, often called lip-sync. The eye and ear are remarkably picky about whether a voice matches a moving mouth. The standard from the broadcast world, Rec. ITU-R BT.1359-1, found that the error becomes detectable when audio leads the picture by about 45 milliseconds or lags it by about 125 milliseconds, and it sets an end-to-end tolerance of roughly +90 milliseconds (audio early) to −185 milliseconds (audio late). The European Broadcasting Union's Recommendation R37 is stricter, asking for +40 to −60 milliseconds across the whole chain. The practical lesson: any AI stage that touches audio or video must preserve sync, not just speed.
Put these together and a working rule emerges. For anything conversational or interactive, aim to finish the entire loop in under 200 milliseconds, and treat 100 milliseconds as the budget you design to so you have headroom when the network has a bad minute. That is where the "sub-100ms" framing comes from. It is not marketing; it is the gap below which a feature stops feeling like software and starts feeling like a reflex.
Figure 1. The human perception thresholds that define the real-time budget. Everything to the left of 200 ms feels live; everything past 400 ms feels like waiting.
The voice standard that has governed this for 25 years
Before AI ever entered the picture, the telephone industry had already measured exactly how much delay a conversation tolerates. The relevant standard is Rec. ITU-T G.114, "One-way transmission time," and its conclusion is the most useful single fact in this entire article.
G.114 measures the one-way path from one person's mouth to the other person's ear — engineers call it mouth-to-ear delay — and it splits the result into three zones. Below 150 milliseconds, essentially all applications experience transparent interactivity; the delay is there but nobody notices it. Between 150 and 400 milliseconds, the conversation still works but degrades, with people increasingly stepping on each other's sentences. Above 400 milliseconds, the standard calls the delay unacceptable for general planning.
Here is why this matters for AI. When you insert an AI stage — a noise suppressor, a translator, a voice agent — into a live call, you are spending part of that same mouth-to-ear budget. The AI does not get its own separate clock. If the network already costs you 120 milliseconds round-trip and your model adds 200, you have blown past the 150-millisecond transparency line and are deep into the zone where the call feels laggy, no matter how good the model is. The budget is shared, fixed, and set by a standard older than most of the engineers reading this.
The budget is a sum, and you do not control all of it
The reason latency planning feels hard is that the total is the sum of many stages, several of which you cannot change. The cleanest way to think about a real-time AI feature is a waterfall: the input enters at the top, and each stage subtracts time from the budget until the result reaches the human at the bottom.
For a video call with an AI stage inserted, the stages are roughly these.
Capture is the time for the camera and microphone to turn light and sound into digital frames. It is small, often a few milliseconds, but it is not zero.
Encoding compresses the raw frames so they fit through the network. A modern hardware H.264 or VP9 encoder adds on the order of 10 milliseconds, sometimes less. The audio codec adds its own delay: Opus, the standard WebRTC audio codec defined in IETF RFC 6716, has a default algorithmic delay of 26.5 milliseconds at its usual 20-millisecond frame size, and can be pushed down to about 5 milliseconds or up to 65 milliseconds depending on configuration.
Network is usually the largest and least controllable stage, and it has a floor set by physics that we will treat separately below. On top of propagation, real systems add relay and routing: a TURN relay typically adds 10 to 30 milliseconds, and a Selective Forwarding Unit — the server that fans your video out to other participants — adds another 5 to 20 milliseconds.
The jitter buffer is the receiver's shock absorber. Packets arrive at uneven intervals, so the player deliberately holds a small reserve of audio to smooth playback. Real-time configurations keep this tight, 10 to 50 milliseconds, but on a poor connection an adaptive buffer can grow to 120 milliseconds or more, trading latency for stability.
Decoding reverses the compression, costing roughly another 10 milliseconds on hardware, and rendering to the screen adds 8 to 16 milliseconds because displays refresh on their own schedule, usually 60 times a second.
Then comes the new stage: AI inference. This is the one you are adding, and it is where your design choices land. Everything else on this list is the cost of simply having a video call. Your AI feature has to fit in whatever the other stages leave behind.
Figure 2. The glass-to-glass waterfall. The fixed stages (gray and blue) consume most of the budget; the AI stage (purple) is the part you design, and it has to fit the remainder.
A worked example: can we add a live translator?
Numbers make this concrete. Suppose two people are on a WebRTC call within the same continent, and you want to add live speech-to-speech translation. Let us add up the fixed stages first, then see what is left for the AI.
Start with the round-trip network on a good regional connection: call it 60 milliseconds. The voice path through the codec and jitter buffer on each side: Opus at 26.5 milliseconds plus a 40-millisecond jitter buffer, roughly 66 milliseconds. Capture, encode, decode, and render together: about 35 milliseconds. So before any AI, the fixed pipeline costs:
60 (network round-trip)
+ 66 (Opus codec + jitter buffer)
+ 35 (capture + encode + decode + render)
= 161 ms of fixed, unavoidable latency
Against a 200-millisecond conversational target, that leaves 39 milliseconds for the AI. A speech-to-speech model cannot translate in 39 milliseconds; the fastest production voice-to-voice systems, such as the OpenAI Realtime API, return their first audio in roughly 300 to 500 milliseconds, and that is before you add transcription and translation quality. The arithmetic tells you the answer immediately: full live translation cannot ride inside the conversational budget today. The honest engineering move is to run it as a near-real-time overlay — translated captions a beat behind, or a consecutive interpretation that accepts a deliberate pause — rather than promising instant dubbing the physics will not allow.
Now flip the example to a feature that does fit. A virtual background using on-device segmentation runs locally, before encoding, so it spends no network time at all. MediaPipe's Selfie Segmentation model, running on the device's GPU in a browser, finishes a frame in under 3 milliseconds — on a recent phone, under 1 millisecond. At 30 frames per second you have 33 milliseconds per frame to work with, so a 3-millisecond model leaves the frame pipeline comfortably intact. This is why background blur shipped years before live translation did: one fits the budget by an order of magnitude, and the other does not.
Physics sets a floor you cannot negotiate
The most common and most expensive mistake in real-time AI is forgetting that the network has a hard floor set by the speed of light. Light in an optical fiber travels at about two-thirds of its speed in a vacuum — roughly 200,000 kilometers per second — because glass slows it down. Engineers use a rule of thumb: about 4.9 microseconds of one-way delay per kilometer of fiber, which works out to roughly 10 milliseconds of round-trip delay for every 1,000 kilometers.
Plug in real distances and the consequence is stark. A user in London talking to a GPU in Northern Virginia is about 5,900 kilometers away, so the round trip through fiber alone is around 60 milliseconds before a single router, queue, or model touches the packet. Real networks never hit the theoretical floor; routing detours, switching, and congestion typically double it. So if your AI inference lives in one cloud region and your users are scattered across continents, you are paying 100 to 200 milliseconds in network tax before the model even starts — and that tax alone can exceed your entire conversational budget.
This is the single fact that reframes the whole problem. Latency is not mainly a model-speed problem; it is a geography problem. You cannot make light faster. You can only move the computation closer to the human, which is what the next section is about.
Common pitfall: teams benchmark an AI feature on a laptop sitting next to the server, see 40 milliseconds, and ship it. In production the users are 4,000 kilometers from that server, the round trip is 120 milliseconds, and the same feature now feels broken. Always benchmark with the network distance your real users will have, not the distance in your office.
Where the AI runs decides whether you have a budget at all
Because geography dominates, the most important architectural decision for a real-time AI feature is where the model executes. There are three options, and they trade latency against cost, privacy, and model size.
On-device means the model runs in the browser or app on the user's own machine. There is zero network latency for the AI itself, which is why segmentation, beauty filters, and lightweight noise suppression run here. The cost is that the device's compute and battery limit how large a model you can run.
Edge means the model runs in a data center physically near the user — the same city or region. You pay one short network hop, often under 20 milliseconds round-trip, and in exchange you can run a much larger model than a phone can hold. This is the sweet spot for medium models that are too heavy for the device but must stay live.
Cloud means the model runs in a central region, possibly an ocean away. You get the biggest models and the simplest operations, but you inherit the full speed-of-light tax. Cloud is the right home for anything that does not have to be instant: summaries after the call, analytics over a recording, a copilot that can take a second to think.
The decision rule is short. If the feature has to feel instant and the model is small, run it on-device. If it must be live but the model is too big for the device, run it at the edge. If a one-to-three-second delay is acceptable, run it in the cloud and stop fighting physics. The mistake is putting an interactive feature in a distant cloud region and then trying to optimize the model to recover latency that the network already spent.
Figure 3. The same model has three possible homes. The network cost rises as you move away from the user; the model size you can afford rises too. Real-time features live on the left.
How AI gets inserted into a live video pipeline
A reasonable question at this point is: mechanically, where does the AI stage even go? In a browser-based WebRTC product, the modern answer is the WebRTC Encoded Transform API, sometimes still called Insertable Streams. It lets your code reach into the media pipeline between the decoder and the renderer, or between the camera and the encoder, and run a function on each frame. That function is where your AI model lives.
The core WebRTC platform itself is now a finished standard: the W3C published WebRTC 1.0 as a full Recommendation on 13 March 2025, and the underlying transport is specified by a family of IETF documents including RFC 8825, RFC 8826, and RFC 8834. The frame-processing extensions — Encoded Transform and the related WebCodecs API, which was still a W3C Working Draft as of April 2026 — are newer and evolve faster, so check current browser support before committing. The deep protocol mechanics of how packets are negotiated and transported — SDP, ICE, STUN, and TURN — belong to the streaming and delivery layer and are worth a separate read; here the point is only that the insertion point exists and is standardized.
The practical implication for budgeting is that the AI function runs synchronously inside the frame loop. If it takes longer than the gap between frames — 33 milliseconds at 30 frames per second — it does not just add latency, it drops frames. So an on-device real-time model has two budgets to meet at once: the overall 200-millisecond loop, and the per-frame deadline of 33 milliseconds. A model that averages 30 milliseconds but occasionally spikes to 50 will stutter visibly, which is why consistency matters as much as the average.
The components, with their real numbers
For quick reference, here is what each stage of a real-time AI video loop typically costs in 2026, drawn from the standards and current vendor figures. Treat these as planning estimates; measure your own pipeline before you commit.
| Stage | Typical latency | Who controls it |
|---|---|---|
| Capture (camera + mic) | 3–10 ms | Hardware |
| Video encode (H.264/VP9, hardware) | ~10 ms | Codec choice |
| Audio encode (Opus, RFC 6716) | 26.5 ms default (5–65 ms range) | Codec config |
| Network propagation | ~10 ms per 1,000 km round-trip | Physics + geography |
| TURN relay | 10–30 ms | Topology |
| SFU forwarding | 5–20 ms | Topology |
| Jitter buffer | 10–50 ms (up to 120 ms on poor links) | Adaptive, network-driven |
| Video decode | ~10 ms | Codec choice |
| Render to display | 8–16 ms (at 60 Hz) | Hardware |
| On-device segmentation (MediaPipe) | <3 ms GPU / 90–120 ms CPU | Your design |
| Streaming ASR first words (Deepgram) | ~150 ms interim, <300 ms | Your design |
| Streaming TTS first audio (ElevenLabs Flash, Cartesia Sonic) | ~75–90 ms | Your design |
| Voice-to-voice first audio (OpenAI Realtime) | ~300–500 ms | Your design |
The pattern in the bottom block is the whole story. On-device vision is cheap enough to run every frame. Streaming transcription and speech synthesis are fast enough to feel live if you keep them at the edge. Full conversational voice-to-voice is still too slow to hide inside a call and has to be designed as a near-real-time experience, not an instant one. Note also the order-of-magnitude penalty for running MediaPipe on CPU instead of GPU — the same model is 30 to 40 times slower without hardware acceleration, which is the difference between fitting the frame budget and shattering it.
Where Fora Soft fits in
We have built real-time video software since 2005 — video conferencing, WebRTC products, e-learning, telemedicine, and live surveillance — and latency budgeting is the conversation we have at the start of every one of those projects. The questions in this article are the questions we ask before estimating: where are the users, where will the model run, what is the fixed pipeline cost, and how much is left for the feature. In telemedicine we have learned that a captioning delay that is fine for a webinar is unacceptable for a doctor reading a patient's words in real time; in conferencing, that an on-device blur and a cloud summary belong in completely different parts of the architecture. The budget is not a detail we optimize at the end. It is the constraint we design around from the first diagram.
What to read next
- Latency, deployment topology, and real-time-vs-batch
- Streaming ASR in production — Deepgram, Whisper, AssemblyAI
- WebRTC + AI: Insertable Streams, Encoded Transform, and the 100ms/Video SDK comparison
Talk to us / See our work / Download
- Talk to a video engineer — bring us your feature idea and we will size its latency budget with you: /services/webrtc-development
- See our case studies — real-time video products we have shipped since 2005: /portfolio
- Download the Real-Time AI Latency Budget Worksheet — a one-page planner with every stage cost and the geography tax: Download the worksheet
References
- Rec. ITU-T G.114 (05/2003), "One-way transmission time" — defines mouth-to-ear delay zones: 0–150 ms preferred/transparent, 150–400 ms acceptable with degradation, above 400 ms unacceptable. The controlling standard for conversational latency. https://www.itu.int/rec/T-REC-G.114
- IETF RFC 6716 (September 2012), "Definition of the Opus Audio Codec" — Opus default algorithmic delay 26.5 ms at 20 ms frame size; configurable 5–65.2 ms; standard WebRTC audio codec. https://www.rfc-editor.org/rfc/rfc6716
- Rec. ITU-R BT.1359-1, "Relative timing of sound and vision for broadcasting" — lip-sync detectability thresholds (audio +45 ms lead / −125 ms lag) and end-to-end tolerance (+90 / −185 ms). https://www.itu.int/rec/R-REC-BT.1359
- W3C, "WebRTC: Real-Time Communication in Browsers" (Recommendation, 13 March 2025) — the finished core WebRTC platform standard. https://www.w3.org/TR/webrtc/
- W3C, "WebRTC Encoded Transform" and "WebCodecs" (Working Draft, April 2026) — the standardized insertion points for running AI on media frames in the browser; status still evolving, verify browser support. https://www.w3.org/TR/webrtc-encoded-transform/
- IETF RFC 8825/8826/8834 (January 2021) — WebRTC overview, security considerations, and media transport / use of RTP; the transport layer beneath the W3C API. https://www.rfc-editor.org/rfc/rfc8834
- Stivers et al., "Universals and cultural variation in turn-taking in conversation," PNAS (2009); Levinson & Torreira, Frontiers in Psychology (2015) — conversational turn gap ~200 ms; gaps under ~120 ms not perceived. https://www.pnas.org/doi/10.1073/pnas.0903616106
- Nielsen, "Response Times: The 3 Important Limits"; Doherty & Thadhani, "The Economic Value of Rapid Response Time" (IBM, 1982) — 0.1 s instantaneous / 1 s flow / 10 s attention; sub-400 ms Doherty productivity threshold. https://www.nngroup.com/articles/response-times-3-important-limits/
- TeleGeography / "the 5-microsecond rule" — fiber propagation ~4.9 µs/km one-way, ~10 ms round-trip per 1,000 km; light in fiber ≈ 200,000 km/s. https://blog.telegeography.com/its-time-to-learn-about-latency
- Deepgram streaming documentation and 2026 benchmarks — Nova-3 sub-300 ms streaming, interim results ~150 ms; Flux end-of-speech detection (May 2026). https://developers.deepgram.com/docs/measuring-streaming-latency
- Gradium TTS Latency Benchmark 2026; Cartesia and ElevenLabs documentation — ElevenLabs Flash v2.5 ~75 ms model inference; Cartesia Sonic 3.5 ~75–90 ms first audio over WebSocket; Sonic Turbo sub-40 ms. https://gradium.ai/content/tts-latency-benchmark-2026
- Google MediaPipe Selfie Segmentation model card and benchmarks — <3 ms GPU web inference, ~0.7 ms on a recent flagship phone, 90–120 ms on CPU; MobileNetV3, 256×256, ~106K parameters. https://ai.google.dev/edge/mediapipe/solutions/vision/image_segmenter
- GetStream and rtcleague WebRTC latency analyses (2026) — glass-to-glass typically 150–300 ms; jitter buffer 10–50 ms real-time; TURN 10–30 ms; SFU 5–20 ms; H.264 encode/decode ~10 ms each; OpenAI Realtime first audio ~300–500 ms. https://getstream.io/blog/low-latency-video-streaming/


