The Shape of AI Inside a Video Product

Why this matters

If you are a product manager, founder, or operations lead at a company that builds video software, you have already been asked this question and you will be asked it again: "can we add AI to our video product?". The honest answer depends entirely on which layer of the product the AI is supposed to live in — and the layers do not behave alike. A background blur that ships in 80 lines of WebGPU on the user's laptop is a two-week project; a server-side video understanding model that watches every meeting in real time is a two-quarter project; an offline content moderator that processes archives nightly is a one-month project. Until you can see the shape of the product, you cannot tell the engineering team where the AI goes, and you cannot tell the finance team what it costs. This article is the map.

Why "where" matters more than "what"

When someone says "let's add AI to our video product", their next sentence is usually a model name — Whisper, YOLO, Sora, Gemini. That order is reversed. The model is the easy choice; in 2026 there are good open-weights or commercial APIs for almost every video AI task. The hard choice is where in your pipeline the model runs.

Three knobs differ at every layer of a video product, and a model that is correct in one layer is wrong in the next:

The latency knob — how many milliseconds you have to return a result. A captioning model that has 200 ms in a live call cannot be the same model that has 4 hours to caption a recording.

The cost knob — what each invocation costs per minute of video. A model that runs on the user's device costs you nothing per minute; the same model called as an API at 30 frames per second costs roughly one dollar per minute per stream.

The privacy knob — whether the frames ever leave the user's machine. A model that runs on-device is invisible to your cloud bill and to your legal team; the same model running in the cloud opens a new privacy review, a new data-residency conversation, and a new line in the EU AI Act audit log.

These three knobs are not independent. Lower latency forces the model closer to the user, which usually forces a smaller model, which usually forces a quality drop. Lower cost forces batching, which forces higher latency. Higher privacy forces on-device, which forces a smaller model again. Every video AI feature is a point in this three-dimensional space, and most of the work of an experienced AI-for-video engineer is choosing the right point.

Before you can choose the point, you need the map.

The five hook points

A video product can be drawn as a pipeline with five places where AI can attach. Each place has its own physics. Below is the map; the rest of this article walks through it.

A horizontal pipeline diagram showing five AI hook points along the video flow, from camera capture on the left through real-time transport, server-side processing, near-real-time post-processing, and long-form analytics on the right. Each hook is labeled with its latency budget and a typical AI capability. Figure 1. The five hook points where AI attaches to a video product. Latency budget tightens as you move left toward the camera and loosens as you move right toward archive analytics.

Hook 1 — Capture (the device)

The capture hook lives on the user's phone, laptop, or camera, before a single bit has left the building. The latency budget is one frame (16 ms at 60 frames per second, 33 ms at 30) because the result has to arrive before the next frame is captured. The deployment topology is on-device: WebGPU, Core ML, TensorFlow Lite, NVIDIA Broadcast SDK, MediaPipe, or a custom Metal/Vulkan shader.

What lives here: background blur, virtual backgrounds, beauty filters, gaze correction, on-device denoising, on-device super-resolution, on-device anomaly pre-filtering. The shared property of every capture-hook feature is that the user sees it instantly — there is no round trip to interrupt the cognitive expectation that a camera shows the world right now.

A typical small model at this hook is MediaPipe Selfie Segmentation v2 (running at ~3 ms per 720p frame on an M-class laptop GPU) doing background blur. The W3C-standardised primitive that lets the model see the frames is MediaStreamTrack Insertable Media Processing using Streams — a browser API that exposes raw VideoFrame objects to JavaScript / WebAssembly / WebGPU before they reach the encoder.

The cost of a capture-hook feature in your cloud bill is zero. The cost in engineering complexity is higher than it looks: you ship the model to every supported platform (iOS, Android, Chrome, Safari, Edge, Electron), you keep it small enough to run on the worst hardware in your install base, and you fall back gracefully when the GPU is busy. Capture-hook features are cheap to operate and expensive to launch.

Hook 2 — Real-time transport (in the call)

The transport hook lives between the camera and the screen, while the media is in flight. The latency budget is 100 to 200 ms end-to-end for a real-time conversation. The deployment topology is usually server-side at a Selective Forwarding Unit (SFU) or an AI worker that joins the call as a participant.

What lives here: live captions, real-time translation, live noise suppression on the server, live content moderation, live speaker diarization, live "join the call" AI agents (think LiveKit Agents joining a meeting as another participant and answering questions).

Two architectural patterns dominate. The first is server-side fan-out: the SFU sends a copy of the audio stream to an AI worker that runs streaming Whisper or Deepgram and broadcasts the resulting captions to every participant. The second is agent-as-participant: the AI joins the call through the same WebRTC stack as a human user, consumes audio and video frames, runs its own model, and emits results as another media track or as a data-channel message.

The latency budget at this hook is the cruel one. A streaming automatic speech recognition (ASR) model has roughly 300 ms to emit a partial result and roughly 1 second to commit a final one before a viewer notices the lag. Anything slower than this and the live caption stops feeling live. This is where most AI-for-video integration projects either succeed or burn the budget.

Hook 3 — Server-side processing (during the session)

The server-side hook is similar to the transport hook but with a relaxed latency budget — seconds rather than hundreds of milliseconds. The deployment topology is usually a stateful AI worker sitting next to the SFU or the ingest server.

What lives here: action-item extraction from the running transcript, mid-call summary, real-time analytics dashboards, in-call sentiment detection, in-call slide-content OCR, in-call object detection for surveillance, in-call anomaly detection. The hook 3 work pieces are "the call is still happening but the user is fine with a 3-second delay".

The same model can sit at hook 2 or hook 3 depending on how fast you need the result. A Whisper streaming model at hook 2 emits partial captions; a Whisper batched model at hook 3 runs once per minute on the latest 60 seconds and emits a higher-quality cleaned-up transcript. The product decides which one matters; you can ship both.

Hook 4 — Near-real-time post-processing (just after the session)

The post-processing hook fires when the session ends. The latency budget is one to ten minutes. The deployment topology is a queue plus a worker fleet: the recording lands in object storage, a job pops off a queue, a worker downloads the file, runs the model, and writes the results back.

What lives here: meeting summaries, chapter markers, action items extracted from the full transcript, automatic highlight reels, lip-sync correction for AI-dubbed audio, archive-grade noise removal, B-roll generation, full-fidelity content moderation. The user expects the result to be ready by the time they go back to the meeting page, but does not need it before they hang up.

This is the cheapest hook to operate per minute of video. The work is batched, the GPU utilisation is high, the requests are not bursty, and you can use spot instances. It is also the hook where the highest-quality models live, because none of the latency constraints from hooks 1–3 apply. The flagship example in 2026 is the meeting-notes pipeline used by every meeting-bot product: Otter, Fireflies, Fathom, Supernormal, and the platform-native equivalents from Zoom and Google Meet.

Hook 5 — Long-form analytics (over the archive)

The analytics hook runs over your entire library — every recorded call, every uploaded video, every archived stream — to build search indexes, train models, generate aggregate insights, or moderate content at the corpus level. The latency budget is hours to days. The deployment topology is a periodic batch job: a Spark or Beam pipeline, a vector database, a multimodal foundation model, and a lot of S3.

What lives here: multimodal video search ("find the moment in last quarter's earnings call when the CFO mentioned AV1"), video retrieval-augmented generation over an archive, content-similarity recommendations, brand-safety scoring across an entire VOD catalogue, training-data curation for a custom model. Netflix's Media Data Lake — the company-wide system built on LanceDB and a unified foundation model — is the canonical large-scale example of this hook in 2026 production.

The cost shape here is opposite to hook 2's. Hook 2 spends money continuously and only on the active streams. Hook 5 spends money in periodic spikes — a corpus re-index every six months can cost more than a quarter of streaming AI — and then sits idle.

What lives at each hook — the capability map

The five hooks map cleanly onto the most common AI capabilities a video product asks for. The table below is the version of the map you can pin above a product backlog.

Capability	Primary hook	Why it sits there	Latency budget
Background blur	Hook 1 (capture)	Result must arrive before the next frame; privacy expectation	16–33 ms
Noise suppression	Hook 1 or 2	Often on-device for privacy; sometimes server-side for archive	10–30 ms (capture) or 100 ms (transport)
Live captions	Hook 2 (transport)	Streaming ASR with partial results; can fan out to all viewers	300 ms partial, 1 s final
Real-time translation	Hook 2 (transport)	Same shape as captions, plus a translation model on top	800 ms partial
Speaker diarization	Hook 2 (live) or Hook 4 (recording)	Live: imperfect, useful; offline: clean, definitive	1 s live, minutes offline
In-call agent ("join my meeting")	Hook 2 (transport)	Agent consumes media as a participant	200–500 ms per turn
In-call action items	Hook 3 (server-side)	Tolerates a few seconds of delay	3–10 s
Meeting summary	Hook 4 (post-processing)	High-quality summary needs full transcript	30 s – 5 min
Chapter markers	Hook 4 (post-processing)	Same as summary	30 s – 5 min
AI-generated B-roll for OTT post-production	Hook 4 or Hook 5	Generative video models, several seconds per shot	minutes – hours
Video search across archive	Hook 5 (analytics)	Multimodal embedding pipeline + vector index	hours to refresh
Content moderation across VOD catalogue	Hook 5 (analytics)	Foundation-model pass; corpus-level	hours to days
Anomaly detection in surveillance	Hook 1 (pre-filter) + Hook 3 (confirm) + Hook 5 (audit)	Tiered architecture; cheap edge, smart server, complete archive	tiered

The table is the practical artefact of this article. When a new feature spec lands in your inbox, you find its row, you read the hook, and you immediately know which engineering team picks it up, what the operations cost will be, and what privacy questions need answering.

Real-time vs batch AI, in the same product

The single most common mistake in AI-for-video integration is treating real-time and batch AI as if they were the same engineering problem. They are not. They share a model name and very little else.

A real-time AI capability has three properties that a batch one does not. It must commit to a partial answer before the input is complete (a streaming ASR cannot wait for the end of the sentence; it must emit partial words as it goes). It must operate inside a fixed latency budget regardless of model load (a batch job can queue; a live caption cannot). And its failure mode is visible to the user immediately (a 5-second freeze in the caption stream is a customer-support ticket; a 5-minute delay in a meeting summary is invisible).

A batch AI capability gets to relax all three. It can wait for the full input. It can queue. Its failure mode is "the result arrives later than the operator expected" — annoying, not fatal.

Two consequences flow from this difference. First, the same model is configured differently on each side: streaming Whisper at hook 2 versus batched Whisper-Large-v3 at hook 4 use different code paths, different chunking, different beam search, and different quality. Second, the same product almost always wants both: a live caption during the call (hook 2, fast and rough) plus a clean transcript after the call (hook 4, slow and exact). Build both, and route the user from one to the other at the moment the call ends. Trying to use the live result as the archive result is the bug that ships the most often.

A two-column comparison illustration showing real-time AI on the left side with latency budgets in milliseconds, partial outputs streaming, and a fixed deadline marker; and batch AI on the right side with latency budgets in minutes, complete outputs, and a queue. A central axis labelled Figure 2. The same AI model sits at different hooks depending on whether the product wants the answer right now or later. Real-time and batch versions are different engineering deliverables, not different configurations of one.

A worked numeric example — background blur at three layers

Pick one feature and walk it through three hooks. Background blur is the easiest example because every video product wants it and the math is uncluttered.

Suppose the product runs at 720p, 30 frames per second, in a 1-on-1 call. We compute the cost and latency of background blur at three different hook points.

Option A — On device (hook 1). MediaPipe Selfie Segmentation v2 with WebGPU runs at approximately 3 ms per frame on an M2 laptop GPU. Per frame, the math is:

latency_per_frame = 3 ms
frames_per_second = 30
GPU_time_per_second_of_video = 3 ms × 30 = 90 ms (≈ 9% GPU utilisation)
cost_to_you_per_minute = $0 (the user's hardware does the work)

Option B — Server-side, per-stream (hook 2). The same model runs in a cloud GPU worker that pulls the raw frames out of the SFU, blurs them, and pushes them back. A modest L4 GPU at $0.65/hour rents at roughly $0.011 per minute. One stream uses 9% of one GPU at 720p30, so per stream:

GPU_cost_per_minute = $0.011 × 0.09 = $0.00099
add round-trip latency: ~80–120 ms extra one-way
total_extra_latency = 100 ms
total_cost_per_minute_per_stream = $0.001

That looks cheap until you multiply by users. A product with 100,000 simultaneous calls (200,000 streams) costs $200/minute or $12,000/hour to blur backgrounds server-side, every minute of every call. The same product running blur on-device costs $0.

Option C — Generative API per frame (hook 2, but a remote-API path). Suppose someone proposes calling a remote inpainting API per frame. At 30 frames per second and a typical $0.001 per call, the cost is $1.80/minute/stream — three orders of magnitude worse than Option A and unusable at any scale. The architectural mistake here is treating background blur as a creative generation problem when it is a segmentation problem. The shape of the model decides the cost.

The point of the example is not the numbers (which shift quarterly) but the spread. The same feature, depending on hook choice, costs you nothing or twelve thousand dollars an hour. The choice is architectural, not budgetary, and you make it before the engineer opens an IDE.

Common mistake — putting the AI in the wrong layer

The most common AI-for-video integration mistake is not picking the wrong model. It is picking the right model but placing it at the wrong hook.

Three patterns repeat:

The first is server-side blur — running background blur at hook 2 when hook 1 was the right answer. The product team picked a server-side model because "the AI team works on the server" and the operations bill arrives a quarter later, when the company is already locked into the architecture.

The second is live summarisation — trying to summarise a meeting at hook 2 or hook 3 when hook 4 was the right answer. The live summary is always worse than the post-call one (because the input is partial), and the user experience suffers in proportion. The fix is almost always to push summarisation to hook 4 and let the live stream carry only the captions and the action-item bullets.

The third is archive-scale moderation at hook 4 — running a per-recording content-moderation pass at hook 4 when hook 5 was the right answer. A nightly batch over the corpus is cheaper, faster to iterate on, and gives the moderation team a global view that per-recording analysis cannot. The fix is to keep hook 4 for the speed-of-thought moderation that has to happen before the recording goes public, and use hook 5 for the corpus-level audit.

If you find yourself negotiating engineering effort with operations cost on every quarterly review, the diagnosis is almost always one of these three. The remedy is to redraw the map.

Where Fora Soft fits in

Fora Soft has built video products since 2005 — video conferencing, OTT and Internet TV, live streaming, surveillance, e-learning, telemedicine, and AR/VR — and we have done AI integration on all five hooks. In the conferencing space we ship on-device background blur and AI noise suppression at hook 1, live captions and AI translation at hook 2, action-item extraction at hook 3, and meeting summarisation at hook 4. In OTT, hook 4 powers our chapter-marker pipelines and hook 5 powers archive search. In surveillance, anomaly detection runs as a tiered architecture across hooks 1, 3, and 5. The map in this article is the map we draw on the whiteboard at the start of every AI engagement.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your ai integration services plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the AI Hook Map for Video Products — One-page reference: five hook points, latency budgets, and the AI capabilities that belong to each.

References

W3C, MediaStreamTrack Insertable Media Processing using Streams — the browser-standard primitive that exposes raw VideoFrame objects for capture-hook processing. https://www.w3.org/TR/mediacapture-transform/
W3C, WebRTC Encoded Transform — the spec that lets JavaScript intercept encoded frames between the encoder and the network, used by some hook-1 and hook-2 patterns. https://www.w3.org/TR/webrtc-encoded-transform/
IETF RFC 8825, Overview: Real-Time Protocols for Browser-Based Applications (WebRTC), January 2021 — the protocol context for hook 2. https://www.rfc-editor.org/rfc/rfc8825
Netflix Tech Blog, Foundation Model for Personalized Recommendation and the Media Data Lake / LanceDB writeups (2024–2026) — canonical hook-5 architecture at scale.
Google Developers, MediaPipe Selfie Segmentation model card — reference implementation for hook-1 background blur. https://ai.google.dev/edge/mediapipe/solutions/vision/image_segmenter
NVIDIA Maxine SDK documentation — production-grade reference for hook-1 and hook-2 noise suppression and video effects. https://developer.nvidia.com/maxine
Atlassian / Conviva / Bitmovin 2026 Video Developer Reports — adoption telemetry for AI features in video products.
EU AI Act (Regulation 2024/1689), Articles 50 and 52 — the regulatory frame for biometric and real-time AI in video products. Official journal text: https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Krisp Technologies, "Background noise removal" technical documentation — production hook-1 / hook-2 noise suppression case study.
LiveKit Agents documentation — production reference architecture for hook-2 "AI as participant" agents. https://docs.livekit.io/agents/