AI in Video Conferencing — Engineering Playbook

Why this matters

Every video product now competes on AI features, and conferencing is where that competition is fiercest: the AI meeting-assistant market sat around $3.5 billion in 2025 and is growing roughly 26% a year, with surveys reporting that 62% of users save about four hours a week once an assistant is in their meetings. If you build or own a conferencing, telehealth, e-learning, or sales-call product, two questions are now on your roadmap — "which AI features do we add, and in what order?" and "do we buy a notetaker or build the capability into our own app?" This playbook answers both for the conferencing vertical specifically. It is written so a product manager can plan the feature set and the privacy posture without an engineering degree, and so an engineer can see exactly where each feature executes and what it costs. It is the vertical map; the deeper Phase 6 lessons in this section are the per-feature engineering manuals it links out to.

What "AI in video conferencing" actually means

Strip away the marketing and a video call is a stream of audio and video moving between people in real time. "AI in video conferencing" means inserting a model somewhere in that stream — to change the media, to read it, or to act on it — fast enough that the call still feels live.

That is the whole idea, and naming the three verbs keeps the rest of this playbook clear. A model can change the media: remove background noise, blur what is behind you, replace your background, smooth your lighting. A model can read the media: turn speech into captions, translate it, label who is speaking, write a summary. Or a model can act on what it read: surface an answer mid-call, draft the follow-up email, update the CRM. Almost every conferencing AI feature on the market is one of these three verbs applied to the live stream.

The reason this matters for planning is that the three verbs have very different engineering costs. Changing the media has to happen in well under a tenth of a second or the call feels broken. Reading the media can lag a little — a caption that arrives a quarter-second late is fine. Acting on the media can take whole seconds, because the human is asking for it and expects a beat. Sort your feature wishlist by verb first, and the deadlines sort themselves.

The feature menu, grouped by the job it does

Buyers think in features; engineers should think in jobs. The same four jobs cover essentially everything a conferencing product ships, and grouping by job — not by vendor — keeps the roadmap honest.

Figure 1. The conferencing AI feature menu. Four jobs, four very different latency deadlines — plan in this order, not by product name.

The first job is cleaning the call: noise suppression that strips keyboard clatter and barking dogs, echo cancellation that stops the feedback loop when someone uses speakers, background blur and replacement for privacy, and auto-framing that keeps a face centered. These run on every frame and every audio packet, so they live under the tightest deadline in the whole product. Modern noise suppression cuts 10–20 decibels of background noise — a decibel is the standard unit for how loud a sound is, and 10 decibels is roughly "half as loud" to the human ear — while leaving speech intelligible.

The second job is making the call understood: live captions, real-time translation, and speaker labels. Captions are produced by automatic speech recognition, usually shortened to ASR, which is simply software that turns spoken audio into written words. On clean English audio the best engines now get under 5% of words wrong; on a messy multi-speaker call they cluster under 10%. Translation adds a step and a little more delay — Google Meet's Gemini-powered translated captions cover 60-plus languages, and the partial caption typically lands in under half a second.

The third job is capturing the knowledge, and this is where the money is. The AI notetaker and the AI meeting assistant both do the same thing in plain terms: they keep a searchable transcript of the call and write a short summary with the action items pulled out. These are the features buyers actually search for — "ai notetaker" alone draws thousands of searches a month — and they are the reason this lesson exists. We give them their own deep section below.

The fourth job is assisting people live: a copilot that answers a question mid-call, drafts the follow-up while the call is still going, or updates a sales record without anyone typing. This is the newest and least mature group, and because the human explicitly asks for it and waits for the reply, it can take a second or two without feeling broken.

The one decision that runs through every feature: where does it run?

Here is the question that decides more than any model choice: for each feature, which machine does the computing? There are only three answers, and every conferencing AI feature lives on one of them.

Figure 2. The three execution tiers. The spine of the whole playbook — every feature is an answer to "device, media server, or cloud?"

The first place is the user's own device — their browser or app, using the laptop's processor or graphics chip. Background blur and noise suppression usually run here. The win is that the audio and video never leave the user's machine to be cleaned, so latency is near zero and privacy is excellent. The catch is that you are at the mercy of whatever hardware the user has, and the work is done once per person rather than once for the room.

The second place is your media server — the piece of infrastructure that already sits in the middle of every group call. In WebRTC, the open standard that browsers use for real-time calls, that middle box is usually a Selective Forwarding Unit, or SFU: a server that receives everyone's audio and video and forwards the right streams to the right people. Because the SFU already has everyone's audio, it is the natural place to transcribe once and send the captions to everyone — the SFU-side captions pattern — and the natural home for recording and the notetaker. The win is one transcription for the whole room, under your control. The cost is that you run and pay for that compute, and the audio passes through your servers.

The third place is a cloud AI API — a service such as a hosted speech-recognition or language-model provider that you send audio or text to and get a result back. Heavy transcription, translation, summaries, and the live copilot usually run here because the quality is highest and you maintain no models yourself. The price is a per-minute fee, the extra round-trip over the network, and the fact that the data leaves your perimeter — which, as we will see, is exactly where the legal questions begin.

Almost no real product picks one tier. A typical conferencing app blurs backgrounds on the device, transcribes on the SFU or a cloud ASR, and summarizes in a cloud language model. The skill is matching each feature to the right tier, and the rule of thumb follows the verbs from earlier: change-the-media features want the device, read-the-media features want the SFU or cloud, act-on-the-media features want the cloud.

The money features up close: AI notetaker and AI meeting assistant

The notetaker and the meeting assistant are the features your buyers name, so they deserve precision. In everyday use the two terms blur together, but there is a useful distinction. A notetaker is the narrower job: capture the call, transcribe it, and produce notes and a summary afterward — its output is a document. A meeting assistant is the broader role: everything the notetaker does, plus live help during the call and actions after it, such as drafting the follow-up or syncing to your CRM — its output is a document and activity. Every notetaker is part of a meeting assistant; not every assistant stops at notes.

For a conferencing product, there are three ways to put this capability in front of users, and they map cleanly onto where the capture happens.

Figure 3. Three routes to a meeting assistant. Only building it natively keeps the transcript, the summary, and the data inside your own product.

The first route is to use the platform's built-in assistant. Zoom has AI Companion, Microsoft Teams has Copilot, and Google Meet has "take notes for me," powered by Gemini. Zoom's assistant transcribes and translates across 30-plus languages and is included with paid plans; Google's writes structured notes straight into a Google Doc. This is the right answer if your team simply meets on one of these platforms and wants notes. It is the wrong answer if you are building a product, because the assistant is locked to that platform and its output never lands inside your app.

The second route is to bolt on an external bot. Tools like Otter, Fireflies, and Fathom — or an infrastructure API like Recall.ai that you build on — send a software participant into the call. Underneath, that "bot" is usually a headless browser, a real web browser running on a server with no screen, that joins the meeting link over WebRTC just like a human's browser would. It is fast to integrate and works across every platform, but it is visible in the participant list, and the audio flows through a third party. The full per-tool engineering of these bots is its own lesson — see the Otter / Fireflies / Fathom deep-dive and the transcription tooling landscape — and the boundary matters for buyers wrestling with what the industry now calls "bot fatigue," a measurable backlash against visible note-taking bots in sensitive calls.

The third route, and the only one that makes the capability part of your product, is to build it into your own real-time pipeline. When the meeting already happens inside your app, you do not need a bot to join from outside — your server is already in the call. You add a silent participant that listens, transcribes, and summarizes, and you render the result on your own dashboard. The standard way to do this in 2026 is a LiveKit agent: LiveKit is the open-source WebRTC framework that also runs real-time audio for products like ChatGPT's voice mode, and its agents framework lets a Python or Node program join a call as a full participant. The agent transcribes through a streaming ASR engine, summarizes with a language model, and writes back to your app — the full note-taker build is its own walkthrough. This route costs the most engineering and the least per seat, keeps the data inside your perimeter, and lets you design consent in from the first sprint.

A worked latency budget — so the deadlines stop being abstract

"Real-time" is a number, not a vibe, and the cheapest way to avoid building something that feels broken is to add up the milliseconds before you write code. Here is the budget for a live caption — the audio leaves a speaker's mouth and the words must appear on everyone's screen before the lag becomes annoying.

Figure 4. A live-caption latency budget. Every stage costs milliseconds; the sum is the only number that decides whether the feature feels live.

Walk it through with the arithmetic shown. Capturing and encoding the audio on the device costs about 30 milliseconds — a millisecond is one-thousandth of a second. The network hop from the device to your media server adds about 40 milliseconds. The speech-recognition engine needs roughly 150 milliseconds to emit a first partial caption. Fanning that caption back out to every participant costs another 40 milliseconds, and drawing the text on each screen costs about 20.

Add them: 30 + 40 + 150 + 40 + 20 = 280 milliseconds. The comfort line for live captions is about 300 milliseconds, so this budget fits — barely. Now notice the lever. The single biggest line item is the 150 milliseconds of recognition. If you had instead sent the audio to a distant cloud ASR with a 120-millisecond round trip on top, the total would be 400 milliseconds and the captions would feel laggy. That is why "where does it run" is the decision that matters: moving the recognition from a far cloud to your own server, or onto the device, is the difference between a feature that feels live and one that does not. The same budgeting discipline applies to every real-time feature; the full method is in the sub-100-millisecond latency budget lesson.

What it costs, and the build-versus-buy arithmetic

Cost in conferencing AI comes in two shapes, and confusing them is how budgets blow up. Bought tools charge per seat: a flat monthly fee per user, predictable and cheap for a normal team, expensive across a large org where many seats barely meet. Built-in capability charges per minute of usage: you pay a cloud provider for the audio it transcribes and the tokens a model generates, which costs nothing when idle and scales with real use.

The table makes the shapes concrete with 2026 figures; verify current numbers before committing, because this category re-prices often.

Approach	What you pay	Good when	The catch
Platform built-in (Zoom / Teams / Meet)	Bundled in the plan you already buy	You meet on one platform and just want notes	Locked to that platform; nothing lands in your product
External bot (Otter / Fireflies / Fathom)	~$8–19 per user per month	Small team, many platforms, fast rollout	Visible bot; data through a third party; per-seat cost grows
Build-it capture API (Recall.ai)	~$0.50 per recording hour + ~$0.15/hr transcription	A product that transcribes only when users hold calls	You still build the summary, storage, and UI
Native pipeline (your SFU + LiveKit agent)	Your cloud ASR + LLM usage + infra	The meeting is your product; data must stay inside	Most engineering up front; you operate it

Run the build-versus-buy numbers, do not guess them. Suppose a sales platform whose customers collectively hold 10,000 hours of recorded calls a month builds on a capture API at roughly $0.50 capture plus $0.15 transcription per hour. That is 10,000 × ($0.50 + $0.15) = 10,000 × $0.65 = $6,500 a month for capture and transcription, before the summary model and storage. Whether that beats per-seat pricing depends entirely on how many seats those 10,000 hours represent — which is the whole reason to do the multiplication rather than eyeball it. The complete cost model, including the language-model token math, is in the real cost of AI in video products lesson.

Common pitfall: treating the AI assistant as a feature you switch on, not a system you run. Teams scope the notetaker as "add transcription and a summary" and ship it, then discover the real work was everywhere else: handling the call where two people talk over each other, getting speaker labels right, storing transcripts under a retention policy, and — the one that bites hardest — capturing consent. On clean audio every ASR engine is within a few points of every other, so the demo always looks great. The product breaks on the messy call, the multilingual call, and the legal review. Budget for the system around the model, not just the model.

The part that turns a feature into a legal duty: consent

This is where a conferencing roadmap quietly becomes a legal one, and 2026 raised the stakes. Treat what follows as engineering-relevant context, not legal advice — confirm specifics with a qualified lawyer for your jurisdiction.

Recording a conversation is regulated. In the United States, the rule splits by state: 39 states plus the District of Columbia allow one-party consent, meaning one person in the conversation can agree to record, while 11 states — including California, Illinois, and Pennsylvania — require all-party consent, meaning everyone must agree first. Under the European Union's General Data Protection Regulation, the GDPR, capturing someone's voice needs a lawful basis and clear, informed consent. The trap that catches teams: a visible bot in the participant list is not, by itself, legal consent. No major jurisdiction treats "they could see the Notetaker" as the informed agreement the law requires.

Two newer risks deserve a flag for anyone building this. First, voiceprints are biometric data: the speaker-labeling step that powers "who said what" builds a voice signature, and in 2026 that signature is increasingly treated as a protected biometric identifier under laws such as Illinois's Biometric Information Privacy Act, which can trigger explicit opt-in duties. Second, litigation is live: class-action suits filed in late 2025 — Brewer v. Otter.ai in California and Cruz v. Fireflies.AI in Illinois — allege exactly these failures, that meeting bots intercepted communications and harvested voice data without all-party consent. Whatever the outcomes, they have made enterprise legal teams cautious and are a large part of why building consent natively into your own pipeline is now a selling point, not just a safeguard.

If you build the assistant into your product, design consent in from the start: disclose the AI before it engages, capture agreement, and give users a real way to decline. In the EU, the AI Act's Article 50 makes that disclosure a transparency obligation, not a nicety — the same discipline covered in the disclosure-engineering lesson and the EU AI Act regulatory lesson. Building natively is the route that lets you satisfy these duties cleanly, because the consent prompt is part of your product flow rather than an outside bot's.

The playbook: a short path from wishlist to working feature

Put the pieces together and a conferencing AI roadmap reduces to four questions asked in order, per feature.

Figure 5. The playbook in one path. Job sets the deadline, deadline sets the tier, product-fit sets build-versus-buy, and consent gates every launch.

First, which job is it — clean, understand, capture, or assist? The job sets the latency deadline, and the deadline rules out tiers that cannot meet it. Second, where should it run — if the deadline is under about a tenth of a second, the work belongs on the device; if a small lag is fine and the whole room needs the same result, the media server; if it is heavy and a beat of delay is acceptable, a cloud API. Third, build or buy — if the output must live inside your product, build it natively into your pipeline; if you just need the capability beside your meetings, buy a tool or bolt on a bot. Fourth, and without exception, the consent gate — disclose the AI, capture agreement, and respect all-party-consent states, the GDPR, and the EU AI Act before the feature ships. Every feature passes through that gate; none skips it.

That is the entire playbook. The Phase 6 lessons in this section are the detailed manuals for each box — background blur, noise suppression, live captions, real-time translation, and the WebRTC-plus-AI integration patterns — and this playbook is the index that tells you which one to open and in what order.

Where Fora Soft fits in

We build the conferencing products these AI features live inside — video meeting platforms, telemedicine consultation apps, e-learning classrooms, and sales-call tools — so we run this playbook with clients regularly. When a feature only needs to sit beside a client's meetings, we help them pick the right tool and the right tier and integrate it. When it has to live inside the product — a telehealth visit note generated on the platform itself, live captions fanned out to every student in a webinar, a sales summary rendered on a client's own dashboard — we build it on a real-time WebRTC pipeline, usually with a LiveKit agent as a silent participant, so the capture is native to the app rather than bolted on by an outside bot, with consent and disclosure designed in from the first sprint. The four questions in this playbook are the same ones we weigh in scoping calls when a client asks whether to buy a notetaker or own the capability.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your ai meeting assistant plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the AI in Video Conferencing — Engineering Playbook Decision Sheet — One-page planner: the feature menu grouped by job, the three-tier where-it-runs map (device / media server / cloud), the build-vs-buy split for the meeting assistant, a live-caption latency budget (280 ms vs the 300 ms comfort line),….

References

Grand View Research — "AI Meeting Assistant Market Size, Share & Trends Analysis Report." Market sizing (~$3.5B in 2025) and ~25.8% CAGR for the AI meeting-assistant category; adoption drivers. Tier 7 (analyst). https://www.grandviewresearch.com/industry-analysis/ai-meeting-assistant-market-report
Laxis — "The State of Meeting Note-Taking 2026: AI Adoption, ROI & Market Benchmarks." AI note-taking segment crossing ~$740M in 2026; ~62% of users saving ~4 hours/week; privacy as the top adoption barrier. Tier 7. https://www.laxis.com/blog/state-of-meeting-note-taking-2026/
W3C Recommendation — "WebRTC: Real-Time Communication in Browsers" (13 March 2025). The finished real-time transport every browser-based call and meeting bot uses to send and receive audio and video; the basis of the SFU media-server tier. Official standard (final Recommendation). https://www.w3.org/TR/webrtc/
W3C — "WebRTC Encoded Transform" (Working Draft, 2025). The browser API that lets code touch each audio/video frame in the call — the mechanism behind on-device noise suppression, blur, and client-side captions. Official standard (W3C Working Draft; status noted). https://www.w3.org/TR/webrtc-encoded-transform/
IETF RFC 6716 — "Definition of the Opus Audio Codec" (September 2012). The default WebRTC audio codec that carries each participant's microphone audio into every ASR engine in this playbook. Official standard (final RFC). https://www.rfc-editor.org/rfc/rfc6716
Regulation (EU) 2024/1689 (EU AI Act), Article 50 — Transparency obligations. The duty to inform people when they interact with or are processed by certain AI systems — the legal basis for disclosing an AI assistant before it engages. Official standard (EU regulation). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Regulation (EU) 2016/679 (GDPR). Lawful basis and informed-consent requirements for capturing voice and other personal data — the European baseline a visible bot does not by itself satisfy. Official standard (EU regulation). https://eur-lex.europa.eu/eli/reg/2016/679/oj
LiveKit — Agents framework documentation and project site. Open-source WebRTC stack used for real-time audio in products including ChatGPT voice; agents framework lets a Python/Node program join a call as a full participant; cascade (STT→LLM→TTS) vs native speech-to-speech, ~700ms–1.2s end-to-end targets. Tier 4 (production deployer). https://docs.livekit.io/agents/
Zoom — "AI Companion: Meeting Summary" product and support documentation. Native-platform assistant: real-time transcription and translation across 30+ languages, summaries, included with paid plans, "My notes" extending to third-party platforms. Tier 4. https://www.zoom.com/en/products/ai-assistant/
Google Workspace — "AI for Meetings & Video Conferencing" and Google Meet "take notes for me." Gemini-powered note-taking into a Google Doc; real-time translated captions across 60+ languages; current transcription-language limits. Tier 4. https://workspace.google.com/resources/ai-for-meetings/
RecordingLaw — "AI Meeting Recording Laws by State (2026)." One-party consent in 39 states + DC; all-party consent in 11 states (incl. California, Illinois, Pennsylvania); voiceprint biometric exposure (Illinois BIPA); Brewer v. Otter.ai (CA, 2025) and Cruz v. Fireflies.AI (IL, 2025). Tier 7 (legal-press orientation; confirm with counsel). https://www.recordinglaw.com/us-laws/ai-meeting-recording-laws/
ZEGOCLOUD — "AI Video Conferencing Explained" and Krisp product documentation. Noise-suppression performance (10–20 dB reduction; scenario-based removal of up to ~80% of background noise) and the menu of real-time conferencing AI features. Tier 4/7. https://www.zegocloud.com/blog/ai-video-conferencing