WebRTC + AI: Insertable Streams, Encoded Transform, WebGPU, And The Video SDK Comparison

Why this matters

If you are a product manager, founder, or engineering lead choosing how to add video to your product — or choosing whether to add AI features like background blur, live captions, or an in-call agent on top of it — the video SDK decision quietly sets the ceiling on everything you can build later. Pick a closed managed platform and you ship a working call in a week, but you may discover six months on that you cannot insert your own model into the media path. Pick an open framework and you get full control at the cost of more engineering. This article gives you the standards that govern where AI can live in a WebRTC call, and a priced, AI-readiness-aware comparison of the major SDKs, so the build-versus-buy conversation happens before you are locked in, not after.

First, what a "video SDK" actually is

A video SDK is a packaged toolkit — code libraries plus a hosted backend — that lets a developer add live video calling to an app without building the hard parts from scratch. The hard parts are real: capturing camera and microphone, compressing the streams, punching through firewalls, routing media between many participants, recording, and handling the thousand edge cases of bad networks. A video SDK does all of that behind a few function calls.

It helps to separate two terms that get used interchangeably. A video sdk is the client-side library you embed in your web, iOS, or Android app. A video conferencing api is the server-side interface — the hosted rooms, tokens, recording, and webhooks the SDK talks to. In practice the major vendors sell both as one product, so when this article says "video SDK" it means the whole package: the libraries plus the cloud behind them.

Almost all of these products are built on the same foundation: WebRTC. WebRTC, short for Web Real-Time Communication, is the open standard that lets browsers and apps send audio, video, and data to each other directly, with low delay. The core platform reached its final milestone when the World Wide Web Consortium, the body that standardizes the web, published WebRTC 1.0 as a full Recommendation on 13 March 2025. Underneath the browser API sits a family of transport specifications from the Internet Engineering Task Force — RFC 8825 through RFC 8834 — that define how the media is secured and moved. The point for a buyer is that every vendor below speaks the same underlying language; what differs is how much of it they expose to you.

The real question: can you get to the frames?

Here is the decision that does not appear on any pricing page but governs whether your product can ever be more than a plain video call. To run an AI feature inside a live call — to blur a background, suppress noise, read lips for live captions, or feed frames to a model — your code has to reach the individual video or audio frames as they flow through the pipeline. Some SDKs hand you that access. Others seal the media path shut and only give you a finished video element on the screen.

Think of the media pipeline as an assembly line. The camera produces raw frames at one end; the network ships them; the far side reassembles and paints them on a screen. An AI feature is a worker you want to insert at a specific station on that line. The question is whether the SDK lets you stand at the line at all, and if so, at which station.

The browser itself now defines exactly two stations where your code is allowed to stand. Understanding them is the whole game.

Diagram of a WebRTC media pipeline from camera capture through encode, packetize, network, depacketize, decode, and render, with two labeled AI insertion points: the raw-frame point before encode and after decode where WebGPU on-device models run, and the encoded-frame point between encoder and packetizer where the Encoded Transform API applies encryption and metadata. Figure 1. The two standardized places AI can enter a WebRTC call. The raw-frame point handles on-device vision and audio; the encoded-frame point handles encryption and lightweight metadata. They solve different problems.

Insertion point one: raw frames, where on-device AI lives

The first station is the raw-frame point. Here the frame is an actual uncompressed image — a grid of pixels — either just after the camera captures it and before it is compressed, or just after it is decompressed on the receiving side and before it is drawn on screen. This is where vision AI belongs, because a model that segments a person from a background or detects a face needs to see real pixels, not compressed data.

The browser exposes this through a W3C specification called MediaStreamTrack Insertable Media Processing using Streams, usually shortened to "the insertable media processing" or "mediacapture-transform" API. It hands your code each camera frame as a VideoFrame object you can read, modify, and pass on. Paired with it is WebCodecs, another W3C specification still progressing through the standards process, which gives low-level access to the encoder and decoder so you can compress or decompress frames yourself when you need to.

The catch with raw-frame processing is speed. At 30 frames per second a new frame arrives every 33 milliseconds, and your AI worker has to finish before the next one shows up or the video stutters. Doing that math in your head: one second divided by 30 frames equals 0.033 seconds, which is 33 milliseconds per frame. A model that takes 40 milliseconds does not just add delay — it drops every third frame. So raw-frame AI has to be fast, and that is where the graphics chip comes in.

WebGPU: the engine that makes on-device AI fast enough

A modern AI model runs far faster on a graphics processing unit, or GPU — the chip originally built to draw game graphics — than on the general-purpose CPU, because a GPU does thousands of small calculations at once. Until recently, browsers had no clean way to use the GPU for general computation. That changed with WebGPU, a W3C standard that gives web code direct, modern access to the GPU for both graphics and raw number-crunching.

WebGPU's importance to this article is timing. As of late 2025 it reached what the web platform community calls "Baseline" status — meaning every major browser ships it on by default. Chrome has had it since version 113 in 2023; Firefox enabled it on Windows in version 141 in July 2025 and on Apple Silicon Macs in version 145; and Safari shipped it in Safari 26 across macOS, iOS, and iPadOS. For the first time, a developer can assume a fast on-device AI path exists in the browser without a plugin or a flag.

Put the pieces together and a clear capability emerges. The raw-frame API gives your code each camera frame; WebGPU runs a model on it in a couple of milliseconds; the modified frame goes back into the pipeline before encoding. This is exactly how production background blur and segmentation work — Google's MediaPipe Selfie Segmentation model finishes a browser frame in under three milliseconds on the GPU, comfortably inside the 33-millisecond budget. We cover that specific feature in depth in the background-blur engineering deep-dive; here the point is structural: raw frames plus WebGPU is the on-device AI path, and it lives entirely on the user's machine with zero network cost.

Insertion point two: encoded frames, where encryption lives

The second station is the encoded-frame point, and it sits in a completely different place: after the video is compressed, between the encoder and the part of the pipeline that splits the compressed data into network packets. Here the frame is no longer an image you can look at — it is a compact block of encoded bytes. You cannot run vision AI on it, because there are no pixels to see. So why would anyone want access here?

Two reasons: encryption and metadata. The browser exposes this point through the WebRTC Encoded Transform specification, a W3C Working Draft on the standards track. Its main API object is called RTCRtpScriptTransform. It replaces an earlier, now-deprecated approach that was called Insertable Streams — if you read older articles or sample code referring to createEncodedStreams, that is the legacy name for the same idea, and new work should use the Encoded Transform API instead. Firefox has implemented it since version 117, and Chrome supports it.

The headline use of the encoded-frame point is end-to-end encryption. In a normal group call, the media server in the middle — called a Selective Forwarding Unit, or SFU, because it forwards each person's video to everyone else — can technically see the media it relays. End-to-end encryption closes that gap by encrypting each frame on the sender's device so that only other participants can decrypt it, leaving the server able to route but not watch. The standard that defines this is IETF RFC 9605, "Secure Frame," published in August 2024, which encrypts whole media frames while leaving the small routing metadata the SFU needs in the clear. The Encoded Transform API is the hook where your code applies that encryption. So the encoded-frame point is not an AI-vision tool; it is a security and signaling tool, and confusing the two is a common and costly mistake.

Common pitfall: teams hear "Insertable Streams let you process media" and assume that is where they plug in their background-blur or captioning model. It is not. The encoded-frame point handles compressed bytes and is for encryption and metadata; vision and audio AI must run at the raw-frame point on uncompressed data. Wiring a model to the wrong station means it has no pixels to work with and silently does nothing useful.

A two-column comparison table contrasting the raw-frame insertion point and the encoded-frame insertion point across the criteria: what the frame contains, the W3C API used, typical uses, where it runs, and the latency constraint. The raw-frame column is tinted to mark it as the AI-vision path; the encoded-frame column marks the encryption path. Figure 2. Two insertion points, two jobs. Raw frames carry pixels for AI to see; encoded frames carry compressed bytes for encryption and metadata. Choosing the wrong one is the most common WebRTC-AI mistake.

Why the SDK choice decides everything above

Now the two halves of this article join. Everything in the sections above assumes you can reach the frames. Whether you actually can depends entirely on the video SDK you chose, because each vendor decides how much of the WebRTC pipeline to expose.

Closed managed platforms hand you a polished video experience and, in exchange, keep the media pipeline private. You get a working call quickly, but your AI options are limited to whatever the vendor pre-built — their background blur, their captions — and you cannot insert your own model into the raw-frame path. Open frameworks take the opposite stance: they expose the WebRTC primitives, so you get the raw-frame and encoded-frame insertion points and can run any model you like, at the cost of building more yourself.

This is the build-versus-buy axis, and for an AI-driven video product it matters more than price. A cheaper SDK that seals the pipeline can be the most expensive choice you make if it blocks the feature your product depends on.

The video SDK landscape in 2026

The market has consolidated and shifted since the early WebRTC years, and a few changes are worth knowing before you compare line items.

100ms is a live-video infrastructure platform founded in 2020 by engineers who previously built large-scale streaming at Disney+ Hotstar and Meta. Its pitch is a single SDK that bundles WebRTC conferencing, HLS broadcast, chat, and recording, with prebuilt interface components so a team ships fast. Pricing is pay-as-you-grow at roughly $0.004 per participant-minute for video, with audio-only workloads discounted around 75%, a free monthly tier of about 10,000 minutes, and AI transcription available out of the box. It is a managed platform: fast to production, with the pipeline largely on their cloud.

Agora is one of the oldest pure-play real-time platforms. Its 2026 rates hold at about $0.99 per thousand audio minutes and $3.99 per thousand high-definition video minutes, with 10,000 free minutes a month shared across its real-time products and automatic volume discounts of five to ten percent at higher tiers. It is feature-rich and global, and also a managed platform.

Daily is known for developer experience and a clean pricing model: 10,000 free participant-minutes a month, then a flat $0.004 per participant-minute, with automatic volume discounts. It exposes more low-level control than most managed vendors and has invested heavily in AI-agent tooling.

Twilio Video deserves a clarification, because the internet is full of outdated claims that it is shutting down. Twilio announced in March 2024 that it would retire Programmable Video, then reversed that decision in October 2024: Twilio Video remains a standalone product with no action required from customers, now focused on one-to-one customer-engagement video. If you read a 2024 article declaring its death, that information is stale.

Vonage Video API is the descendant of TokBox OpenTok, a long-standing name in WebRTC. The platform is operational, but the original OpenTok SDK line is in maintenance mode and has been folded into a "Unified" Vonage Video API with new authentication (an Application ID and private key replacing the old API key and secret). Existing projects on OpenTok should plan a migration; treat that as a 2026 task, not an optional one.

Zoom Video SDK lets developers embed Zoom's media engine in their own apps, distinct from the Zoom meetings product. It runs about $0.0035 per participant-minute — $3.50 per thousand — sold through the Zoom Build Platform as credit packs ($100 for 100 credits, $450 for 500), without a true pay-as-you-go mode, which makes it awkward for bursty or seasonal traffic.

Amazon Chime SDK is AWS's building-block approach: pay-per-use with no minimums, audio at $0.0017 per attendee-minute, and video billed by bitrate, from $0.0015 per minute at low quality up to $0.017 per minute above two megabits per second. It integrates naturally with the rest of AWS and is a sensible choice for teams already deep in that ecosystem.

LiveKit is the standout open framework. Its server is open source, so you can self-host and pay only for your own cloud infrastructure, or use LiveKit Cloud with a free Build tier (5,000 WebRTC minutes, 50 gigabytes of egress, 1,000 AI-agent minutes monthly), a $50 Ship tier, and a $500 Scale tier. WebRTC connection time runs roughly $0.0004 to $0.0005 per minute plus bandwidth at $0.10 to $0.12 per gigabyte. Because it exposes the full pipeline and has built a dedicated AI-agents framework, it has become the default choice for teams whose product is the AI feature rather than a plain call. We cover its architecture and pricing in detail in the LiveKit meeting-assistant lesson.

The comparison that matters: price and AI-readiness together

Price alone is misleading, because the SDKs meter differently — some per participant-minute, some per thousand minutes, some by bandwidth — and because the cheapest sealed pipeline can block the feature you need. The table below puts the 2026 numbers next to the one column most comparisons omit: whether you can insert your own AI into the media path.

SDK	2026 price (approx.)	Free tier	Pipeline model	Insert your own AI?
100ms	~$0.004 / participant-min (video)	~10,000 min/mo	Managed	Limited — vendor AI features
Agora	$3.99 / 1k HD video min	10,000 min/mo (shared)	Managed	Limited — extensions framework
Daily	$0.004 / participant-min	10,000 participant-min/mo	Managed, lower-level	Partial — frame access + agents
Twilio Video	Usage-based (1:1 focus)	Trial credit	Managed	Limited
Vonage Video API	Usage-based (Unified)	Trial credit	Managed	Limited
Zoom Video SDK	~$0.0035 / participant-min	Credit-pack trial	Managed	Limited — vendor AI add-ons
Amazon Chime SDK	$0.0017 audio / $0.0015–0.017 video per min	Pay-per-use	Building blocks	Partial — you wire the media
LiveKit	$0.0004–0.0005 / WebRTC min + bandwidth	Build tier (5,000 min)	Open / self-hostable	Full — raw + encoded frames

The pattern is the story. Managed platforms cluster around $0.004 per participant-minute and trade pipeline control for speed of delivery. Building-block and open options (Amazon Chime SDK, LiveKit) cost less per minute but ask more of your engineers, and they are the ones that let you reach the frames AI needs. There is no single winner — only a fit for your priorities.

A worked example: what 1,000 group calls actually cost

Numbers make the trade concrete. Suppose your product runs 1,000 group calls a month, each with four participants for 30 minutes. The billable unit most vendors use is the participant-minute, so first compute the volume:

1,000 calls
× 4 participants
× 30 minutes
= 120,000 participant-minutes per month

On a managed platform at $0.004 per participant-minute, before any free tier:

120,000 participant-minutes
× $0.004
= $480 per month

Subtract a 10,000-minute free allowance and you are near $440. On LiveKit Cloud, the same usage is metered as WebRTC connection minutes at roughly $0.0005 per minute plus bandwidth, which for standard-definition group video typically lands lower per minute but adds a separate egress charge — so the comparison depends on your video bitrate, not just minutes. The lesson is not "X is cheapest." It is that you cannot compare two SDKs until you convert both to the same unit and add bandwidth where the model charges for it separately. A headline per-minute rate that ignores egress can understate the real bill by half. The downloadable worksheet at the end runs this conversion for every vendor in the table.

A build-versus-buy decision tree for choosing a video SDK. Top question asks whether AI in the media path is core to the product; a yes branch leads toward open frameworks and self-hosting, a no branch toward managed platforms. Further questions cover engineering capacity, scale, and whether end-to-end encryption is required, ending in recommended categories. Figure 3. The decision is not "which SDK is cheapest" but "does my product need to own the media path." Answer that first; price and vendor follow.

Where AI features map onto the two insertion points

To close the loop, here is how the common AI video features land on the pipeline, so you can predict whether a given SDK supports them. Features that need pixels — background blur, virtual backgrounds, beauty filters, face detection, on-device captions — live at the raw-frame point and depend on the SDK exposing camera frames plus a WebGPU path. Features that need security — end-to-end encryption, watermarking metadata — live at the encoded-frame point and depend on Encoded Transform support. Features that run on the server rather than in the browser — server-side transcription, recording analysis, an in-call agent that joins as a participant — do not need either insertion point in the client; they need an SDK that lets a bot subscribe to the media server, which is where open frameworks and building-block platforms shine.

This mapping is why the SDK choice and the AI architecture are one decision, not two. A managed platform with great prebuilt blur but no raw-frame access is perfect until your product needs a custom model, at which point you are rebuilding on a different SDK. Deciding which insertion points you will need in year two should shape the SDK you pick in week one.

Where Fora Soft fits in

We have built real-time video products since 2005 — video conferencing, WebRTC apps, e-learning, telemedicine, and live surveillance — and the build-versus-buy SDK question is the first architectural conversation on almost every project. We have shipped on managed platforms when speed to market mattered, and on open frameworks like LiveKit when the product's value was the AI running inside the call. Our engineering team publishes ongoing cost and migration analyses comparing these platforms, because the right answer changes with scale, feature roadmap, and how much of the media path a product must own. The work we care about is matching the SDK to the AI you intend to build, so the pipeline you choose in week one still supports the feature you dream up in year two.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your video sdk plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Video SDK & WebRTC AI Selection Worksheet — One-page planner: the two WebRTC AI insertion points (raw-frame for vision + WebGPU, encoded-frame for E2EE via Encoded Transform), the 2026 WebGPU browser-support matrix, the priced eight-SDK comparison (100ms, Agora, Daily, Twilio….

References

W3C, "WebRTC: Real-Time Communication in Browsers" (Recommendation, 13 March 2025) — the finished core WebRTC platform standard that every vendor below builds on. https://www.w3.org/TR/webrtc/
W3C, "WebRTC Encoded Transform" (Working Draft, standards track) — defines RTCRtpScriptTransform, the encoded-frame insertion point; supersedes the deprecated Insertable Streams (createEncodedStreams). Firefox implements from v117; Chrome supported. https://www.w3.org/TR/webrtc-encoded-transform/
W3C, "MediaStreamTrack Insertable Media Processing using Streams" (Working Draft) — the raw-frame insertion point exposing each camera frame as a VideoFrame; the place on-device vision AI runs. https://www.w3.org/TR/mediacapture-transform/
W3C, "WebCodecs" (Working Draft) — low-level encode/decode access that pairs with the raw-frame API and WebGPU. https://www.w3.org/TR/webcodecs/
W3C, "WebGPU" (GPU for the Web Working Group) — modern GPU compute for the browser; reached Baseline across Chrome, Edge, Firefox, and Safari 26 in late 2025. https://www.w3.org/TR/webgpu/
IETF RFC 9605 (August 2024), "Secure Frame (SFrame)" — end-to-end authenticated encryption of whole media frames; the SFU sees routing metadata but not media. Applied via the Encoded Transform API. https://www.rfc-editor.org/rfc/rfc9605
IETF RFC 8825/8826/8834 (January 2021), WebRTC overview, security, and media transport — the transport layer beneath the W3C browser API. https://www.rfc-editor.org/rfc/rfc8825
IETF RFC 6716 (September 2012), "Definition of the Opus Audio Codec" — the standard WebRTC audio codec; relevant to the audio path the SDKs carry. https://www.rfc-editor.org/rfc/rfc6716
100ms — Pricing and About — single SDK bundling WebRTC conferencing, HLS, and chat; ~$0.004/participant-min video, ~75% audio discount, ~10,000 free min/mo; founded 2020 by ex-Disney+ Hotstar / Meta engineers. https://www.100ms.live/pricing
Agora — Pricing — $0.99/1k audio min, $3.99/1k HD video min; 10,000 free min/mo shared; 5–10% volume discounts (2026). https://www.agora.io/en/pricing/
Daily — Video API Pricing — 10,000 free participant-min/mo, then $0.004/participant-min, automatic volume discounts. https://www.daily.co/pricing/video-sdk/
Twilio — "Twilio Video Will Remain a Standalone Product" (Changelog, 21 October 2024) — reverses the March 2024 end-of-life decision; Video remains a standalone product, no customer action required. https://www.twilio.com/en-us/changelog/-twilio-video-will-remain-a-standalone-product
Vonage — Migration FAQ: from OpenTok to Unified Video API — OpenTok SDK line in maintenance; auth moves from API key/secret to Application ID + private key. https://api.support.vonage.com/hc/en-us/articles/22304634772764-Migration-FAQ-from-OpenTok-to-Unified-Video-API
Amazon Chime SDK — Pricing — audio $0.0017/attendee-min; video $0.0015/min (0–200 kbps) up to $0.017/min (>2,000 kbps); pay-per-use. https://aws.amazon.com/chime/chime-sdk/pricing/
LiveKit — Pricing — open-source server (self-host) plus Cloud tiers Build ($0), Ship ($50), Scale ($500); WebRTC $0.0004–0.0005/min + bandwidth $0.10–0.12/GB. https://livekit.com/pricing
Zoom — Video SDK pricing (2026 analyses) — ~$0.0035/participant-min via Zoom Build Platform credit packs; no true pay-as-you-go. https://trtc.io/blog/details/zoom-video-sdk-pricing-2026