Published 2026-06-02 · 31 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

This article is for the product manager scoping an "AI video calling" feature who needs to know what is genuinely one project versus ten, the founder deciding whether to build on an SDK or assemble open-source parts, and the engineer who has read the individual Phase 6 lessons and now wants to see them connected into one deployable product. It assumes you have read the earlier lessons it links to, because this capstone wires them together rather than re-deriving each one. By the end you will be able to draw the architecture of a modern AI-enhanced call on a whiteboard, place any new feature in the right layer on the first try, put defensible latency and cost numbers on the result, and sequence the build so the first useful version ships in weeks, not quarters.

What "AI-Enhanced Video Call" Actually Means

Before any architecture, fix what we are building. A plain video call moves sound and pictures between people in real time. An AI-enhanced video call does the same thing, then adds a set of features that each lean on a machine-learning model: it blurs or replaces your background, strips the dog barking out of your microphone, prints live captions under each speaker, translates those captions into another language, flags a screenshare that leaks a credit-card number, and afterwards hands you a summary with action items. Each of those is a Phase 6 lesson on its own. The capstone is the act of running all of them at once, on the same call, without any one of them spoiling the experience of the others.

"Real-time" is the constraint that makes this hard, and it has a precise meaning here. A conversation feels natural only when the round trip — you speak, the other person hears you, they react — stays under roughly the time it takes to blink and think. Research on human conversation finds that people start to perceive a reply as unnaturally delayed somewhere past 300 to 500 milliseconds, and a practical ceiling for a voice-to-voice AI loop is about 800 milliseconds before the exchange feels broken. A millisecond, written ms, is one-thousandth of a second; 800 ms is just under one second. Every AI feature you bolt onto the call spends part of that budget. The whole engineering challenge of this capstone is adding intelligence without spending more time than the conversation can afford.

A useful image: a live call is like a relay race where the baton is the conversation. Every runner you add to the track — the denoiser, the captioner, the translator — has to hand the baton on before the next exchange, or the whole team falls behind. You cannot ask the conversation to slow down. Everything below is about fitting more runners onto the same track without dropping the baton.

The Three Planes: The One Mental Model That Organizes Everything

Almost every confused architecture we are asked to rescue comes from mixing up three jobs that should stay separate. Keep them separate and the system becomes easy to reason about. We call them the three planes.

The first is the media plane. This is the part that carries the actual audio and video — the bytes of sound and picture — from each person to everyone else. The media plane is the road the conversation travels on. In a modern call it is built on WebRTC, the browser standard for real-time audio and video that became an official W3C Recommendation in January 2021, paired with a server called an SFU that we explain in a moment. The media plane has the tightest time budget, because anything that sits in the path of live audio and video adds delay you can hear and see.

The second is the AI plane. This is every machine-learning model that adds intelligence: the background-blur model, the noise suppressor, the speech-to-text model behind captions, the translation model, the content-moderation classifier, the meeting-notes assistant. The AI plane is the set of specialists standing along the road. Crucially, these specialists do not all stand in the same place — some ride in the car with the speaker, some stand at the central junction, and some work from an office nearby. Deciding where each one stands is the core skill this article teaches.

The third is the control plane. This is the coordination layer: who is allowed in the room, who is speaking, who turned their camera off, when the AI assistant joins, which captions go to whom. The control plane is the race official with the clipboard. It carries almost no heavy data, so it has the loosest time budget, but it is what keeps the other two planes in sync. Mixing control logic into the media path — for example, asking the video server to also decide who may join — is a classic mistake that makes systems impossible to scale.

Layered architecture diagram of an AI-enhanced video call showing three horizontal planes. The media plane carries audio and video from sender devices through an SFU to receiver devices. The AI plane sits across three locations: on-device pre-processing on the sender, SFU-side processing on the server, and a cloud AI agent that joins as a participant. The control plane coordinates rooms, participants, and signaling underneath. Figure 1. The reference architecture. Every AI feature lives in one of three locations across the media and AI planes — on the sender's device, on the SFU, or inside a cloud AI agent that joins the call like any other participant.

The Media Plane: WebRTC And The SFU

Start with the road, because every AI feature attaches to it. In a call with more than two or three people, the participants do not send their video directly to each other — that would force every phone to upload its camera to everyone else at once, which drains batteries and saturates home internet connections. Instead, everyone sends one copy of their audio and video to a server in the middle, and that server forwards each stream on to the others. That server is a Selective Forwarding Unit, or SFU. The name describes the job: it selectively forwards media streams without re-processing them.

The word "without re-processing" is the important part. An older design, the MCU (Multipoint Control Unit), decoded everyone's video, mixed it into one combined picture, and re-encoded it — expensive work that needs a powerful server per call. An SFU does almost none of that. It receives each person's stream over the real-time transport protocol, RTP (secured as SRTP so it is encrypted in transit), and relays the packets on, at most choosing which quality layer to forward when bandwidth is tight. Because it never transcodes, one SFU can handle far more participants for the same cost. This is why essentially every production video platform in 2026 — LiveKit, mediasoup, Janus, and the commercial SDKs built on them — uses the SFU model.

The audio itself is carried by Opus, the open, royalty-free voice and music codec standardized as IETF RFC 6716, which is what every WebRTC endpoint speaks. A codec is the method for compressing sound into a small stream of bytes and decompressing it on the other side. Opus matters to the AI plane for one reason we return to repeatedly: most audio AI models need the raw, uncompressed sound, not the Opus-compressed bytes. Where a feature sits relative to the Opus encoder decides whether it can work at all.

The practical takeaway is that the media plane is largely a solved, standardized problem. You do not invent it; you adopt WebRTC and an SFU. Lesson 6.2 on WebRTC AI APIs and SDK selection covers how to choose between building on raw WebRTC and buying a video SDK. The interesting decisions all live in where you attach the AI plane, which is the rest of this article.

The Three Homes For An AI Feature

Every AI feature in the call runs in one of exactly three places. Naming them precisely is the most useful thing you can do at the start of a project, because the choice drives cost, privacy, latency, and which devices you can support.

The first home is on the sender's device — in the browser tab or the phone app, before the audio and video are compressed and sent. This is where features that change what you transmit belong: background blur and replacement, noise suppression, beauty and AR filters. The reason is mechanical. These features must alter the raw camera frames and raw microphone sound, and the only place the raw signal exists is on the device that captured it. Run background blur on the sender's device and everyone downstream — the SFU, the other participants — receives the already-blurred video and never sees the real room. That is also the strongest privacy position: the unmodified signal never leaves the building.

The browser makes this possible through a small family of standards. WebCodecs gives low-level access to encode and decode frames, supported in Chrome, Edge, Firefox 130+, and Safari 26. MediaStreamTrackProcessor breaks a live camera or microphone track into individual frames you can edit. WebGPU, now supported across all major browsers, lets the editing run on the device's graphics chip so it is fast enough for live video. For audio, an AudioWorklet runs a small denoiser on the raw sound off the main thread. Lesson 6.3 on background blur and lesson 6.5 on noise suppression integration are the deep dives.

The second home is on the SFU, the forwarding server. A narrow set of features belong here because they need to see one stream but not every participant's device can be trusted or upgraded — and because the SFU already has every stream passing through it. The clearest example is content moderation: you want to inspect frames and audio centrally so a malicious client cannot simply disable the check, which is why lesson 6.12 puts real-time moderation in the SFU. The trap to avoid is doing heavy per-stream AI on the SFU itself; the SFU's superpower is that it does not decode video, and a model that forces it to decode throws that efficiency away. The pattern that works is to have the SFU fan a copy of the stream out to a separate worker, not to run the model inline.

The third home is a cloud AI agent that joins the call as a participant. This is the most important architectural idea in the whole phase, and it is easy to miss. Instead of bolting AI onto the server's internals, you let a program join the room exactly like a human would — it subscribes to the audio and video streams, runs models on them, and publishes results (captions, a translated voice, a summary) back into the room. The captioner, the translator, and the meeting assistant all live here. This is the model LiveKit's Agents framework formalizes: an agent is "any Python or Node.js program added to a room as a full real-time participant." Because the agent is just another participant, it scales independently of the SFU, it can be written by your application team rather than your infrastructure team, and you can run one per room without touching the media server.

Decision matrix mapping seven AI call features to their correct home. Rows are background blur, noise suppression, AR/beauty filters, live captions, live translation, content moderation, and meeting notetaker. Columns are on-device, SFU-side, and cloud AI agent. The recommended cell in each row is highlighted, with short reasons noted. Figure 2. Where each feature belongs. Match the feature to the home that has the signal it needs and the trust boundary it requires — not to whatever layer is easiest to edit.

Walking The Pipeline, Feature By Feature

With the three homes defined, place each Phase 6 feature and the reasoning becomes mechanical. Follow the conversation from one person's camera to another's screen.

The signal begins at the sender's camera and microphone. The very first thing it meets is on-device pre-processing. The background-blur model replaces the pixels behind the person, using a segmentation model — one that decides, pixel by pixel, "this is the person, that is the wall" — running on WebGPU. In parallel, the noise suppressor cleans the microphone audio inside an AudioWorklet, removing keyboard clatter and background voices before the sound is compressed. If the product offers beauty filters or gaze correction, they run here too, covered in lesson 6.8 on AR effects. The golden rule for this stage: there must be exactly one noise suppressor in the chain. Browsers ship a free built-in one; if you add your own, you must turn the built-in one off, or the two fight and produce a robotic, underwater voice. Running two suppressors is the single most common audio bug we see in AI call products.

The cleaned, compressed streams travel to the SFU, which forwards them to the other participants and, critically, also makes them available to the AI agent. The SFU itself does almost no AI work. Its one AI-adjacent job in our reference design is to make a copy of each stream available for inspection so that the moderation worker can watch for unsafe content without slowing the live path.

The AI agent subscribes to the streams and does the linguistic heavy lifting. It runs streaming speech-to-text — automatic speech recognition, or ASR — to produce live captions, the subject of lesson 6.9 on SFU-side caption fan-out. "Streaming" here means the model emits words as they are spoken rather than waiting for the sentence to finish, which is what makes captions feel live. From the same transcript, a translation model produces captions or a synthesized voice in another language — lesson 6.11 on real-time speech translation — and the meeting assistant accumulates the transcript to produce a running summary and action items, the territory of lesson 6.14 on the LiveKit meeting assistant. If your product offers an AI avatar that lip-syncs a synthetic face to generated speech, it is published back into the room by an agent as well, per lesson 6.13 on in-call avatars.

Finally the streams and the agent's published results arrive at each receiver's device, which decodes the audio and video and overlays the captions. Notice the symmetry: heavy pixel and audio work happens at the two ends where the raw signal lives, language work happens in the agent where a model can see the whole conversation, and the SFU in the middle stays lean. That shape is not an accident; it is what keeps the system inside its time and cost budgets.

Two Budgets You Must Keep: Interactive Media And The AI Voice Loop

There is no single latency number for an AI call, because two different loops run at once and they have different ceilings. Confusing them is how teams end up optimizing the wrong thing.

The first is the interactive media budget — how long your voice takes to reach the other person's ear. This is the budget that governs whether people talk over each other. The target for comfortable conversation is to keep one-way audio under about 200 ms and the full round trip under roughly 400 ms. On-device features spend into this budget directly: a denoiser that adds 20 ms and a blur that adds 15 ms are both in the live path. The arithmetic is simple and unforgiving. If network transport costs you 120 ms one way, you have about 80 ms left for capture, on-device AI, encoding, and playout combined. That is why on-device models must be small and fast, and why you never stack four of them carelessly.

The second is the AI voice-loop budget — how long an AI assistant takes to hear a question and start answering. This loop is speech-to-text, then a language model, then text-to-speech, and its practical ceiling is about 800 ms before the assistant feels sluggish. A representative 2026 breakdown, drawn from production voice-agent platforms, looks like this: voice-activity detection and turn-taking decision, 150–300 ms; final speech-to-text transcript, 50–150 ms after the person stops speaking; language-model time-to-first-token, 150–400 ms; text-to-speech first audio chunk, 100–200 ms; network overhead, 30–80 ms. Add the low ends and you are near 480 ms; add the high ends and you blow past 1,000 ms. Let us show one path explicitly: 200 ms turn-taking + 100 ms transcript + 300 ms model first token + 150 ms speech first chunk + 50 ms network = 800 ms. That is the budget, fully spent, with nothing to spare.

The lever that hides in plain sight is turn-taking — deciding the person has actually finished speaking. A clumsy turn detector can add 500 ms of perceived delay that never appears in any model benchmark, because the time is lost waiting, not computing. Modern frameworks use a small transformer model trained to predict end-of-turn rather than a fixed silence timer, which is why the LiveKit Agents stack and similar systems ship turn detection as a first-class component. Streaming every stage — partial transcripts into the model, model tokens into the speech synthesizer before the sentence is done — is what lets a well-built loop come in under one second. Lesson 6.1 on the sub-100 ms latency budget breaks these numbers down stage by stage.

Two horizontal budget bars stacked. The top bar shows the interactive media one-way budget of about 200 milliseconds split into capture, on-device AI, encode, network, and playout segments. The bottom bar shows the AI voice loop budget of about 800 milliseconds split into turn-taking, speech-to-text, language model first token, text-to-speech first chunk, and network segments, with the cumulative total marked. Figure 3. The two budgets that run in parallel. The media loop (top) decides whether people talk over each other; the AI voice loop (bottom) decides whether the assistant feels responsive. They are tuned separately.

A Worked Example: One Call With Five Features

Numbers make the architecture concrete. Picture a 25-person company all-hands on a video platform, with five AI features switched on: background blur for everyone, noise suppression for everyone, live English captions, live Spanish translation for the six remote staff who prefer it, and an AI notetaker producing the summary. Where does the work actually land?

Background blur and noise suppression run on each of the 25 devices — 25 small models, but each one runs on its owner's hardware and costs you, the platform operator, nothing in server compute. This is the quiet superpower of on-device placement: it scales for free with the number of participants, because every new participant brings their own processor. The SFU forwards 25 inbound audio streams and a selection of video streams, transcoding none of them, which is why a single modest server can host the room.

The captions, translation, and notes run in one AI agent that joins the room. The agent runs streaming ASR on the active speaker — usually one person at a time in an all-hands, so you transcribe one stream, not 25. The English transcript feeds both the caption overlay and the Spanish translation, and accumulates into the notetaker's running summary. The cost you pay scales with minutes of speech processed and tokens generated, not with the number of silent listeners. A back-of-envelope figure: one hour of single-speaker ASR plus translation plus periodic summarization is a few cents to a few tens of cents of model cost in 2026, depending on the providers — a rounding error next to the value of a searchable, translated, summarized meeting. Lesson 6.18 on meeting-bot engineering breaks down how the commercial notetakers structure exactly this.

The lesson of the worked example is the cost shape. On-device features cost the operator nothing per participant; agent features cost per minute of speech and per token, not per attendee. A product that understands this prices and scales correctly; a product that runs everything in the cloud pays for 25 GPU slots when it needed one agent.

Common Mistake: Putting AI In The Wrong Plane

The failure we are called to fix most often is not a bad model — it is a good model in the wrong home. Three versions recur.

The first is running an audio AI feature after the Opus encoder instead of before it. A team wires their noise suppressor into the WebRTC Encoded Transform path — the API that lets you edit encoded frames — and is baffled that it does nothing useful. The reason is that encoded frames are Opus-compressed bytes, not sound; a denoiser needs the actual waveform. Noise suppression must run on raw audio, in an AudioWorklet, before encoding. Encoded Transform is the right tool for a different job — adding an extra layer of end-to-end encryption that the SFU forwards without being able to read — but it is the wrong tool for anything that needs to understand the media.

The second is making the SFU decode video to run a model inline. The SFU's entire efficiency comes from forwarding without transcoding. The moment you force it to decode every participant's video to run detection, you have rebuilt the expensive MCU you were trying to avoid, and your per-room server cost multiplies. The fix is the agent pattern: fan a copy of the stream to a separate worker that decodes and runs the model, leaving the SFU lean.

The third is stacking on-device models without a budget. Each on-device feature — blur, denoise, beauty filter — spends milliseconds in the live path. Add them without measuring and you quietly push one-way latency past the point where conversation breaks, and you melt the battery on mid-range phones. The discipline is to keep a written per-device budget and measure each feature's real cost on real hardware, exactly as lesson 6.1 prescribes. If the budget is full, a feature moves to the agent or gets cut; it does not get crammed into a path that cannot afford it.

Build Order: Ship Value Early, Not All At Once

You do not build this system all at once, and you do not build it in the order the diagram is drawn. You build it in the order that ships usable value at each step, so that a working product exists from week one and every later addition is independently shippable.

The foundation is a plain, reliable call on WebRTC and an SFU, with no AI at all. Get audio and video solid, get reconnection working, get it stable on the devices your users actually have. Everything else attaches to this; if it is shaky, no amount of AI will save the product. This is also the right moment to choose build-versus-buy for the media layer, the decision lesson 6.2 frames.

The first AI layer is on-device cleanup: background blur and noise suppression. These are the features users notice immediately, they cost the operator nothing per participant, and they do not require any new backend service. They are the highest value-to-effort ratio in the whole system, which is why they come first.

The second layer is the AI agent for captions. Standing up one agent that joins the room and produces live captions is the moment the product becomes genuinely "AI-enhanced," and it establishes the agent pattern that every later language feature reuses. Once captions work, translation is a small extension of the same agent — the transcript already exists; you add a translation step. Meeting notes and summaries are another extension, reusing the same accumulated transcript.

The last layer is the specialized and the regulated: content moderation, in-call avatars, sales-coaching or domain-specific assistants. These are higher effort, narrower in appeal, or carry compliance weight, so they come once the core is proven. Moderation in particular is best added deliberately rather than rushed, because it touches both the trust model and, in regulated markets, the law. Note one hard requirement that spans the whole build: where your product generates or substantially alters a person's image or voice — an AI avatar, a cloned voice, a synthetic translation spoken in the user's own voice — the EU AI Act's Article 50 transparency rules require that this be disclosed to participants. Build the disclosure in from the start; it is far cheaper than retrofitting it.

Build-order diagram as a left-to-right staircase of four steps. Step one is a plain WebRTC plus SFU call. Step two adds on-device blur and noise suppression. Step three adds an AI agent for captions, then translation and notes as extensions. Step four adds moderation, avatars, and specialized assistants. Each step is labelled as independently shippable. Figure 4. The build staircase. Each step ships a working product on its own, so value compounds instead of waiting for a big-bang launch.

Build Versus Buy: Where The Line Falls In 2026

A reasonable team does not write all of this from scratch, and does not buy all of it either. The line in 2026 falls in a fairly stable place.

The media plane — WebRTC plus an SFU — is something you almost always adopt rather than build, either as open source you host (LiveKit, mediasoup) or as a managed SDK. Real-time media at scale involves a long tail of network edge cases, mobile quirks, and reconnection logic that takes years to get right; there is no business reason to re-learn it. The agent framework is increasingly something you adopt too: the LiveKit Agents SDK and its peers already solve turn detection, interruption handling, and the streaming STT-LLM-TTS plumbing that is tedious and error-prone to build yourself.

What you build is the part that is your product: which features you expose, how they are tuned for your vertical, the business rules in your agent, and the user experience around them. What you buy as a service is usually the individual models — the ASR provider, the translation model, the text-to-speech voice — because these improve monthly and self-hosting them only pays off at large, steady volume. The decision for each model is the build-versus-buy framing of lesson 6.6 on Krisp, Maxine, and Dolby: buy for speed and breadth of device support, self-host when volume is high and predictable and privacy rules demand the data never leave your servers.

Where Fora Soft Fits In

Fora Soft has built real-time video software since 2005, and the assembly described here — WebRTC media, an SFU, and AI features placed across device, server, and agent — is the backbone of the conferencing, e-learning, and telemedicine products we ship. We have integrated on-device background processing and noise suppression into live calls, stood up AI agents that produce live captions and summaries, and added content moderation to user-generated video without collapsing the call's latency budget. The three-plane discipline in this article is not theory for us; it is the checklist we apply when scoping a new build, because it is the difference between a feature that ships and one that quietly makes every call worse. Our verticals here are video conferencing, e-learning, telemedicine, and surveillance, where real-time AI on live video is the core of the product rather than a decoration.

What To Read Next

Talk To Us · See Our Work · Download

  • Talk to a video engineer about scoping an AI-enhanced calling product: /contact
  • See our case studies in conferencing, e-learning, and telemedicine: /cases
  • Download the AI Video Call Architecture Blueprint (one page): the blueprint PDF

References

  1. W3C — WebRTC 1.0: Real-Time Communication Between Browsers (W3C Recommendation, 26 January 2021; SFU/peer model, RTCRtpSender/Receiver, media transport). https://www.w3.org/TR/webrtc/
  2. W3C — WebRTC Encoded Transform (W3C Working Draft; RTCRtpScriptTransform manipulates encoded frames between encoder/packetizer and depacketizer/decoder; replaces the deprecated Insertable Streams API). https://www.w3.org/TR/webrtc-encoded-transform/
  3. W3C — MediaStreamTrack Insertable Media Processing using Streams (MediaStreamTrackProcessor / MediaStreamTrackGenerator; bridge between WebRTC tracks and WebCodecs). https://www.w3.org/TR/mediacapture-transform/
  4. W3C — WebCodecs (low-level per-frame encode/decode; processing via JS, WASM, WebGPU, or WebNN). https://www.w3.org/TR/webcodecs/
  5. IETF — RFC 6716: Definition of the Opus Audio Codec (September 2012; the royalty-free voice/music codec used by every WebRTC endpoint). https://www.rfc-editor.org/rfc/rfc6716
  6. IETF — RFC 8825: Overview: Real-Time Protocols for Browser-Based Applications (the WebRTC architecture overview RFC). https://www.rfc-editor.org/rfc/rfc8825
  7. IETF — RFC 3550: RTP — A Transport Protocol for Real-Time Applications (RTP/SRTP, the media transport an SFU forwards). https://www.rfc-editor.org/rfc/rfc3550
  8. LiveKit — Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (streaming pipeline, turn detection transformer, interruption handling, sub-1s end-to-end). https://livekit.com/blog/voice-agent-architecture-stt-llm-tts-pipelines-explained
  9. LiveKit — Agents documentation (an agent is any Python/Node program joining a room as a full real-time participant; built-in turn detection and noise cancellation). https://docs.livekit.io/agents/
  10. Smallest.ai — Designing Voice Assistants: STT, LLM, TTS, Tools, and Latency Budget (2026 800 ms budget breakdown: VAD 50 ms, STT 150 ms, LLM TTFT 400 ms, TTS first chunk 150 ms, network 50 ms). https://smallest.ai/blog/designing-voice-assistants-stt-llm-tts-tools-and-latency-budget
  11. Retell AI — How Real-Time Voice AI Actually Works (STT → LLM → TTS) (network 30–80 ms, turn-taking 150–300 ms, STT 50–100 ms, LLM TTFT 150–400 ms). https://www.retellai.com/blog/how-real-time-voice-ai-works-stt-llm-tts
  12. web.dev — WebGPU is now supported in major browsers (Chrome/Edge 113, Android Chrome 121, Safari and Firefox support). https://web.dev/blog/webgpu-supported-major-browsers
  13. getstream.io — Selective Forwarding Unit (SFU) (SFU forwards per-stream RTP/SRTP without transcoding; contrast with MCU mixing). https://getstream.io/resources/projects/webrtc/architectures/sfu/
  14. European Union — Regulation (EU) 2024/1689 (AI Act), Article 50 — Transparency obligations (disclosure of AI-generated or manipulated audio/video to affected persons). https://eur-lex.europa.eu/eli/reg/2024/1689/oj