Published 2026-06-01 · 19 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you are a product manager, founder, or engineering lead choosing how to add video to your product — or choosing whether to add AI features like background blur, live captions, or an in-call agent on top of it — the video SDK decision quietly sets the ceiling on everything you can build later. Pick a closed managed platform and you ship a working call in a week, but you may discover six months on that you cannot insert your own model into the media path. Pick an open framework and you get full control at the cost of more engineering. This article gives you the standards that govern where AI can live in a WebRTC call, and a priced, AI-readiness-aware comparison of the major SDKs, so the build-versus-buy conversation happens before you are locked in, not after.
First, what a "video SDK" actually is
A video SDK is a packaged toolkit — code libraries plus a hosted backend — that lets a developer add live video calling to an app without building the hard parts from scratch. The hard parts are real: capturing camera and microphone, compressing the streams, punching through firewalls, routing media between many participants, recording, and handling the thousand edge cases of bad networks. A video SDK does all of that behind a few function calls.
It helps to separate two terms that get used interchangeably. A video sdk is the client-side library you embed in your web, iOS, or Android app. A video conferencing api is the server-side interface — the hosted rooms, tokens, recording, and webhooks the SDK talks to. In practice the major vendors sell both as one product, so when this article says "video SDK" it means the whole package: the libraries plus the cloud behind them.
Almost all of these products are built on the same foundation: WebRTC. WebRTC, short for Web Real-Time Communication, is the open standard that lets browsers and apps send audio, video, and data to each other directly, with low delay. The core platform reached its final milestone when the World Wide Web Consortium, the body that standardizes the web, published WebRTC 1.0 as a full Recommendation on 13 March 2025. Underneath the browser API sits a family of transport specifications from the Internet Engineering Task Force — RFC 8825 through RFC 8834 — that define how the media is secured and moved. The point for a buyer is that every vendor below speaks the same underlying language; what differs is how much of it they expose to you.
The real question: can you get to the frames?
Here is the decision that does not appear on any pricing page but governs whether your product can ever be more than a plain video call. To run an AI feature inside a live call — to blur a background, suppress noise, read lips for live captions, or feed frames to a model — your code has to reach the individual video or audio frames as they flow through the pipeline. Some SDKs hand you that access. Others seal the media path shut and only give you a finished video element on the screen.
Think of the media pipeline as an assembly line. The camera produces raw frames at one end; the network ships them; the far side reassembles and paints them on a screen. An AI feature is a worker you want to insert at a specific station on that line. The question is whether the SDK lets you stand at the line at all, and if so, at which station.
The browser itself now defines exactly two stations where your code is allowed to stand. Understanding them is the whole game.
Figure 1. The two standardized places AI can enter a WebRTC call. The raw-frame point handles on-device vision and audio; the encoded-frame point handles encryption and lightweight metadata. They solve different problems.
Insertion point one: raw frames, where on-device AI lives
The first station is the raw-frame point. Here the frame is an actual uncompressed image — a grid of pixels — either just after the camera captures it and before it is compressed, or just after it is decompressed on the receiving side and before it is drawn on screen. This is where vision AI belongs, because a model that segments a person from a background or detects a face needs to see real pixels, not compressed data.
The browser exposes this through a W3C specification called MediaStreamTrack Insertable Media Processing using Streams, usually shortened to "the insertable media processing" or "mediacapture-transform" API. It hands your code each camera frame as a VideoFrame object you can read, modify, and pass on. Paired with it is WebCodecs, another W3C specification still progressing through the standards process, which gives low-level access to the encoder and decoder so you can compress or decompress frames yourself when you need to.
The catch with raw-frame processing is speed. At 30 frames per second a new frame arrives every 33 milliseconds, and your AI worker has to finish before the next one shows up or the video stutters. Doing that math in your head: one second divided by 30 frames equals 0.033 seconds, which is 33 milliseconds per frame. A model that takes 40 milliseconds does not just add delay — it drops every third frame. So raw-frame AI has to be fast, and that is where the graphics chip comes in.
WebGPU: the engine that makes on-device AI fast enough
A modern AI model runs far faster on a graphics processing unit, or GPU — the chip originally built to draw game graphics — than on the general-purpose CPU, because a GPU does thousands of small calculations at once. Until recently, browsers had no clean way to use the GPU for general computation. That changed with WebGPU, a W3C standard that gives web code direct, modern access to the GPU for both graphics and raw number-crunching.
WebGPU's importance to this article is timing. As of late 2025 it reached what the web platform community calls "Baseline" status — meaning every major browser ships it on by default. Chrome has had it since version 113 in 2023; Firefox enabled it on Windows in version 141 in July 2025 and on Apple Silicon Macs in version 145; and Safari shipped it in Safari 26 across macOS, iOS, and iPadOS. For the first time, a developer can assume a fast on-device AI path exists in the browser without a plugin or a flag.
Put the pieces together and a clear capability emerges. The raw-frame API gives your code each camera frame; WebGPU runs a model on it in a couple of milliseconds; the modified frame goes back into the pipeline before encoding. This is exactly how production background blur and segmentation work — Google's MediaPipe Selfie Segmentation model finishes a browser frame in under three milliseconds on the GPU, comfortably inside the 33-millisecond budget. We cover that specific feature in depth in the background-blur engineering deep-dive; here the point is structural: raw frames plus WebGPU is the on-device AI path, and it lives entirely on the user's machine with zero network cost.
Insertion point two: encoded frames, where encryption lives
The second station is the encoded-frame point, and it sits in a completely different place: after the video is compressed, between the encoder and the part of the pipeline that splits the compressed data into network packets. Here the frame is no longer an image you can look at — it is a compact block of encoded bytes. You cannot run vision AI on it, because there are no pixels to see. So why would anyone want access here?
Two reasons: encryption and metadata. The browser exposes this point through the WebRTC Encoded Transform specification, a W3C Working Draft on the standards track. Its main API object is called RTCRtpScriptTransform. It replaces an earlier, now-deprecated approach that was called Insertable Streams — if you read older articles or sample code referring to createEncodedStreams, that is the legacy name for the same idea, and new work should use the Encoded Transform API instead. Firefox has implemented it since version 117, and Chrome supports it.
The headline use of the encoded-frame point is end-to-end encryption. In a normal group call, the media server in the middle — called a Selective Forwarding Unit, or SFU, because it forwards each person's video to everyone else — can technically see the media it relays. End-to-end encryption closes that gap by encrypting each frame on the sender's device so that only other participants can decrypt it, leaving the server able to route but not watch. The standard that defines this is IETF RFC 9605, "Secure Frame," published in August 2024, which encrypts whole media frames while leaving the small routing metadata the SFU needs in the clear. The Encoded Transform API is the hook where your code applies that encryption. So the encoded-frame point is not an AI-vision tool; it is a security and signaling tool, and confusing the two is a common and costly mistake.
Common pitfall: teams hear "Insertable Streams let you process media" and assume that is where they plug in their background-blur or captioning model. It is not. The encoded-frame point handles compressed bytes and is for encryption and metadata; vision and audio AI must run at the raw-frame point on uncompressed data. Wiring a model to the wrong station means it has no pixels to work with and silently does nothing useful.
Figure 2. Two insertion points, two jobs. Raw frames carry pixels for AI to see; encoded frames carry compressed bytes for encryption and metadata. Choosing the wrong one is the most common WebRTC-AI mistake.
Why the SDK choice decides everything above
Now the two halves of this article join. Everything in the sections above assumes you can reach the frames. Whether you actually can depends entirely on the video SDK you chose, because each vendor decides how much of the WebRTC pipeline to expose.
Closed managed platforms hand you a polished video experience and, in exchange, keep the media pipeline private. You get a working call quickly, but your AI options are limited to whatever the vendor pre-built — their background blur, their captions — and you cannot insert your own model into the raw-frame path. Open frameworks take the opposite stance: they expose the WebRTC primitives, so you get the raw-frame and encoded-frame insertion points and can run any model you like, at the cost of building more yourself.
This is the build-versus-buy axis, and for an AI-driven video product it matters more than price. A cheaper SDK that seals the pipeline can be the most expensive choice you make if it blocks the feature your product depends on.
The video SDK landscape in 2026
The market has consolidated and shifted since the early WebRTC years, and a few changes are worth knowing before you compare line items.
100ms is a live-video infrastructure platform founded in 2020 by engineers who previously built large-scale streaming at Disney+ Hotstar and Meta. Its pitch is a single SDK that bundles WebRTC conferencing, HLS broadcast, chat, and recording, with prebuilt interface components so a team ships fast. Pricing is pay-as-you-grow at roughly $0.004 per participant-minute for video, with audio-only workloads discounted around 75%, a free monthly tier of about 10,000 minutes, and AI transcription available out of the box. It is a managed platform: fast to production, with the pipeline largely on their cloud.
Agora is one of the oldest pure-play real-time platforms. Its 2026 rates hold at about $0.99 per thousand audio minutes and $3.99 per thousand high-definition video minutes, with 10,000 free minutes a month shared across its real-time products and automatic volume discounts of five to ten percent at higher tiers. It is feature-rich and global, and also a managed platform.
Daily is known for developer experience and a clean pricing model: 10,000 free participant-minutes a month, then a flat $0.004 per participant-minute, with automatic volume discounts. It exposes more low-level control than most managed vendors and has invested heavily in AI-agent tooling.
Twilio Video deserves a clarification, because the internet is full of outdated claims that it is shutting down. Twilio announced in March 2024 that it would retire Programmable Video, then reversed that decision in October 2024: Twilio Video remains a standalone product with no action required from customers, now focused on one-to-one customer-engagement video. If you read a 2024 article declaring its death, that information is stale.
Vonage Video API is the descendant of TokBox OpenTok, a long-standing name in WebRTC. The platform is operational, but the original OpenTok SDK line is in maintenance mode and has been folded into a "Unified" Vonage Video API with new authentication (an Application ID and private key replacing the old API key and secret). Existing projects on OpenTok should plan a migration; treat that as a 2026 task, not an optional one.
Zoom Video SDK lets developers embed Zoom's media engine in their own apps, distinct from the Zoom meetings product. It runs about $0.0035 per participant-minute — $3.50 per thousand — sold through the Zoom Build Platform as credit packs ($100 for 100 credits, $450 for 500), without a true pay-as-you-go mode, which makes it awkward for bursty or seasonal traffic.
Amazon Chime SDK is AWS's building-block approach: pay-per-use with no minimums, audio at $0.0017 per attendee-minute, and video billed by bitrate, from $0.0015 per minute at low quality up to $0.017 per minute above two megabits per second. It integrates naturally with the rest of AWS and is a sensible choice for teams already deep in that ecosystem.
LiveKit is the standout open framework. Its server is open source, so you can self-host and pay only for your own cloud infrastructure, or use LiveKit Cloud with a free Build tier (5,000 WebRTC minutes, 50 gigabytes of egress, 1,000 AI-agent minutes monthly), a $50 Ship tier, and a $500 Scale tier. WebRTC connection time runs roughly $0.0004 to $0.0005 per minute plus bandwidth at $0.10 to $0.12 per gigabyte. Because it exposes the full pipeline and has built a dedicated AI-agents framework, it has become the default choice for teams whose product is the AI feature rather than a plain call. We cover its architecture and pricing in detail in the LiveKit meeting-assistant lesson.
The comparison that matters: price and AI-readiness together
Price alone is misleading, because the SDKs meter differently — some per participant-minute, some per thousand minutes, some by bandwidth — and because the cheapest sealed pipeline can block the feature you need. The table below puts the 2026 numbers next to the one column most comparisons omit: whether you can insert your own AI into the media path.
| SDK | 2026 price (approx.) | Free tier | Pipeline model | Insert your own AI? |
|---|---|---|---|---|
| 100ms | ~$0.004 / participant-min (video) | ~10,000 min/mo | Managed | Limited — vendor AI features |
| Agora | $3.99 / 1k HD video min | 10,000 min/mo (shared) | Managed | Limited — extensions framework |
| Daily | $0.004 / participant-min | 10,000 participant-min/mo | Managed, lower-level | Partial — frame access + agents |
| Twilio Video | Usage-based (1:1 focus) | Trial credit | Managed | Limited |
| Vonage Video API | Usage-based (Unified) | Trial credit | Managed | Limited |
| Zoom Video SDK | ~$0.0035 / participant-min | Credit-pack trial | Managed | Limited — vendor AI add-ons |
| Amazon Chime SDK | $0.0017 audio / $0.0015–0.017 video per min | Pay-per-use | Building blocks | Partial — you wire the media |
| LiveKit | $0.0004–0.0005 / WebRTC min + bandwidth | Build tier (5,000 min) | Open / self-hostable | Full — raw + encoded frames |
The pattern is the story. Managed platforms cluster around $0.004 per participant-minute and trade pipeline control for speed of delivery. Building-block and open options (Amazon Chime SDK, LiveKit) cost less per minute but ask more of your engineers, and they are the ones that let you reach the frames AI needs. There is no single winner — only a fit for your priorities.
A worked example: what 1,000 group calls actually cost
Numbers make the trade concrete. Suppose your product runs 1,000 group calls a month, each with four participants for 30 minutes. The billable unit most vendors use is the participant-minute, so first compute the volume:
1,000 calls
× 4 participants
× 30 minutes
= 120,000 participant-minutes per month
On a managed platform at $0.004 per participant-minute, before any free tier:
120,000 participant-minutes
× $0.004
= $480 per month
Subtract a 10,000-minute free allowance and you are near $440. On LiveKit Cloud, the same usage is metered as WebRTC connection minutes at roughly $0.0005 per minute plus bandwidth, which for standard-definition group video typically lands lower per minute but adds a separate egress charge — so the comparison depends on your video bitrate, not just minutes. The lesson is not "X is cheapest." It is that you cannot compare two SDKs until you convert both to the same unit and add bandwidth where the model charges for it separately. A headline per-minute rate that ignores egress can understate the real bill by half. The downloadable worksheet at the end runs this conversion for every vendor in the table.
Figure 3. The decision is not "which SDK is cheapest" but "does my product need to own the media path." Answer that first; price and vendor follow.
Where AI features map onto the two insertion points
To close the loop, here is how the common AI video features land on the pipeline, so you can predict whether a given SDK supports them. Features that need pixels — background blur, virtual backgrounds, beauty filters, face detection, on-device captions — live at the raw-frame point and depend on the SDK exposing camera frames plus a WebGPU path. Features that need security — end-to-end encryption, watermarking metadata — live at the encoded-frame point and depend on Encoded Transform support. Features that run on the server rather than in the browser — server-side transcription, recording analysis, an in-call agent that joins as a participant — do not need either insertion point in the client; they need an SDK that lets a bot subscribe to the media server, which is where open frameworks and building-block platforms shine.
This mapping is why the SDK choice and the AI architecture are one decision, not two. A managed platform with great prebuilt blur but no raw-frame access is perfect until your product needs a custom model, at which point you are rebuilding on a different SDK. Deciding which insertion points you will need in year two should shape the SDK you pick in week one.
Where Fora Soft fits in
We have built real-time video products since 2005 — video conferencing, WebRTC apps, e-learning, telemedicine, and live surveillance — and the build-versus-buy SDK question is the first architectural conversation on almost every project. We have shipped on managed platforms when speed to market mattered, and on open frameworks like LiveKit when the product's value was the AI running inside the call. Our engineering team publishes ongoing cost and migration analyses comparing these platforms, because the right answer changes with scale, feature roadmap, and how much of the media path a product must own. The work we care about is matching the SDK to the AI you intend to build, so the pipeline you choose in week one still supports the feature you dream up in year two.
What to read next
- Sub-100ms Real-Time Latency Budget Engineered
- How To Blur Background On Zoom, Teams, And iPhone — Engineering Deep-Dive
- LiveKit Real-Time AI Meeting Assistant — Architecture And Pricing
Talk to us / See our work / Download
- Talk to a video engineer — bring us your product idea and we will help you choose the SDK that fits the AI you want to build: /services/webrtc-development
- See our work — real-time video products we have shipped since 2005: /portfolio
- Download the Video SDK & WebRTC AI Selection Worksheet — a one-page planner with the two insertion points, the WebGPU support matrix, the priced SDK comparison, and the build-vs-buy checklist: Download the worksheet
References
- W3C, "WebRTC: Real-Time Communication in Browsers" (Recommendation, 13 March 2025) — the finished core WebRTC platform standard that every vendor below builds on. https://www.w3.org/TR/webrtc/
- W3C, "WebRTC Encoded Transform" (Working Draft, standards track) — defines
RTCRtpScriptTransform, the encoded-frame insertion point; supersedes the deprecated Insertable Streams (createEncodedStreams). Firefox implements from v117; Chrome supported. https://www.w3.org/TR/webrtc-encoded-transform/ - W3C, "MediaStreamTrack Insertable Media Processing using Streams" (Working Draft) — the raw-frame insertion point exposing each camera frame as a
VideoFrame; the place on-device vision AI runs. https://www.w3.org/TR/mediacapture-transform/ - W3C, "WebCodecs" (Working Draft) — low-level encode/decode access that pairs with the raw-frame API and WebGPU. https://www.w3.org/TR/webcodecs/
- W3C, "WebGPU" (GPU for the Web Working Group) — modern GPU compute for the browser; reached Baseline across Chrome, Edge, Firefox, and Safari 26 in late 2025. https://www.w3.org/TR/webgpu/
- IETF RFC 9605 (August 2024), "Secure Frame (SFrame)" — end-to-end authenticated encryption of whole media frames; the SFU sees routing metadata but not media. Applied via the Encoded Transform API. https://www.rfc-editor.org/rfc/rfc9605
- IETF RFC 8825/8826/8834 (January 2021), WebRTC overview, security, and media transport — the transport layer beneath the W3C browser API. https://www.rfc-editor.org/rfc/rfc8825
- IETF RFC 6716 (September 2012), "Definition of the Opus Audio Codec" — the standard WebRTC audio codec; relevant to the audio path the SDKs carry. https://www.rfc-editor.org/rfc/rfc6716
- 100ms — Pricing and About — single SDK bundling WebRTC conferencing, HLS, and chat; ~$0.004/participant-min video, ~75% audio discount, ~10,000 free min/mo; founded 2020 by ex-Disney+ Hotstar / Meta engineers. https://www.100ms.live/pricing
- Agora — Pricing — $0.99/1k audio min, $3.99/1k HD video min; 10,000 free min/mo shared; 5–10% volume discounts (2026). https://www.agora.io/en/pricing/
- Daily — Video API Pricing — 10,000 free participant-min/mo, then $0.004/participant-min, automatic volume discounts. https://www.daily.co/pricing/video-sdk/
- Twilio — "Twilio Video Will Remain a Standalone Product" (Changelog, 21 October 2024) — reverses the March 2024 end-of-life decision; Video remains a standalone product, no customer action required. https://www.twilio.com/en-us/changelog/-twilio-video-will-remain-a-standalone-product
- Vonage — Migration FAQ: from OpenTok to Unified Video API — OpenTok SDK line in maintenance; auth moves from API key/secret to Application ID + private key. https://api.support.vonage.com/hc/en-us/articles/22304634772764-Migration-FAQ-from-OpenTok-to-Unified-Video-API
- Amazon Chime SDK — Pricing — audio $0.0017/attendee-min; video $0.0015/min (0–200 kbps) up to $0.017/min (>2,000 kbps); pay-per-use. https://aws.amazon.com/chime/chime-sdk/pricing/
- LiveKit — Pricing — open-source server (self-host) plus Cloud tiers Build ($0), Ship ($50), Scale ($500); WebRTC $0.0004–0.0005/min + bandwidth $0.10–0.12/GB. https://livekit.com/pricing
- Zoom — Video SDK pricing (2026 analyses) — ~$0.0035/participant-min via Zoom Build Platform credit packs; no true pay-as-you-go. https://trtc.io/blog/details/zoom-video-sdk-pricing-2026


