Published 2026-06-04 · 35 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
This article is for the founder or product manager who has decided to build an "AI video meetings" product and now needs to know what the real thing costs, how long it takes, and which parts are bought versus built — and for the engineer who has read the individual lessons and wants them welded into one deployable system with named technologies and numbers. It assumes you have met the underlying ideas already, because a capstone assembles rather than re-derives; the cross-links point back to each foundational lesson when you need the detail. By the end you will be able to draw the production architecture on a whiteboard, name the exact 2026 technology in every box, defend the cost per meeting-hour to a finance team, and sequence the build so the first paying-customer-ready version ships in weeks. Read the Phase 6 capstone on the three-plane architecture first if you want the conceptual map; this article is its engineering counterpart.
What You Are Building, Stated Precisely
Fix the product before any technology. You are building a multi-party video meeting application — think a focused Zoom or Google Meet — with five AI features layered on top of a normal call: background blur and replacement, microphone noise suppression, live captions, live translation of those captions into another language, and an AI notetaker that produces a summary with action items after the meeting ends. None of those five is exotic in 2026; the engineering work is making all five run at once, on the same meeting, without the call feeling slow, and without the cloud bill scaling with the number of silent listeners.
"Real-time" is the constraint that shapes every later decision, and it has a measurable meaning. A conversation feels natural only while the gap between one person finishing and the other reacting stays short. Published research and production practice put comfortable two-way audio under roughly 200 milliseconds one way — a millisecond, written ms, is one-thousandth of a second, so 200 ms is one-fifth of a second — and the practical ceiling for an AI assistant to hear a question and begin answering is about 800 ms before it feels sluggish. Every feature you add spends part of one of those two budgets. The art of the build is adding intelligence while staying inside them.
A useful picture: a meeting is a relay race where the baton is the conversation, and every AI feature is another runner you put on the track. Add runners carelessly and the team falls behind — the call lags, people talk over each other, the assistant answers a beat too late. The architecture below exists to fit more runners on the track without dropping the baton.
The Spine: One Media Server And The Agent Pattern
Two ideas carry the entire build. Get them right and everything else is detail.
The first is the media server, specifically a kind called an SFU — a Selective Forwarding Unit. In a meeting with more than two or three people, participants do not send video straight to each other, because that forces every laptop to upload its camera to everyone at once, draining batteries and saturating home internet. Instead, each person sends one copy of their audio and video to a server in the middle, and that server forwards each stream to the others. "Selective forwarding" is the whole job: it relays streams and, when bandwidth is tight, chooses which quality layer to pass on — but it never decodes and re-encodes the video, which is the expensive part. Because it never transcodes, one modest SFU serves a large room cheaply. Every serious 2026 platform uses this model.
The second idea is the AI agent as a participant. Rather than wiring machine-learning models into the server's internals, you write a small program that joins the meeting room exactly like a person would: it subscribes to the audio and video, runs models, and publishes results — captions, a translated voice, a running summary — back into the room. This is the pattern the LiveKit Agents framework formalizes, defining an agent as any Python or Node.js program added to a room as a full real-time participant. The payoff is separation: the agent scales independently of the media server, it is written by your application team rather than your infrastructure team, and you can run exactly one per meeting without ever touching the media plumbing.
Hold those two ideas together and the platform has a clean shape. The SFU moves pixels and sound between humans. Agents add intelligence by joining rooms. Everything in the rest of this article is deciding what goes where, what you build versus buy, and how it scales. For the deeper reasoning behind this division — the "three planes" mental model of media, AI, and control — see the Phase 6 capstone; here we take it as settled and build on it.
The Production Architecture, Box By Box
A real deployment is more than an SFU and an agent. Six kinds of component show up in every conferencing platform we have shipped, and naming them precisely is the first hour of any project.
The client is the browser tab or mobile app each participant uses. It captures the camera and microphone, runs the on-device AI (background blur, noise suppression), encodes the media, and renders everyone else. This is where the privacy-sensitive, per-person work lives, because the raw camera frames and raw microphone sound exist only on the device that captured them.
The token server is a small backend service you write. Before anyone joins a room, your application decides whether they are allowed in and issues a short, signed pass — a token — that the media server will accept. This is your business logic: authentication, who can host, who can only watch. It carries no media, so it is cheap and easy to scale, and keeping it separate from the media server is what lets you change access rules without touching real-time infrastructure.
The SFU is the media server described above. In production it does not run as a single machine; it runs as a pool of instances, usually on Kubernetes — the open-source system for running and automatically scaling containers across many servers — so that a busy hour spins up more SFU capacity and a quiet night spins it back down.
The TURN server solves a networking problem that sinks roughly 15 to 20 percent of connections if you ignore it. Many corporate firewalls and mobile carrier networks refuse to let two devices connect directly. A TURN server (the name comes from "Traversal Using Relays around NAT") is a relay that both sides can always reach, so the media has a path even through a hostile network. The standard open-source implementation is coturn, which implements the TURN and STUN standards (IETF RFC 8656 and RFC 8489). Skip it and one in five of your users simply cannot connect — a failure that looks random and is maddening to debug.
The agent workers are the pool of programs that join rooms to do AI. One worker handles speech-to-text for captions, translation, and the notetaker; you run them as their own auto-scaling pool because their cost and load profile differ completely from the SFU's. When a meeting starts, a dispatcher assigns an agent to it; when it ends, the agent leaves and its slot frees up.
The datastore and application backend hold everything that outlives the meeting: user accounts, room history, and the transcripts and summaries the notetaker produces. This is ordinary web-application territory — a database, an API, object storage for recordings — and it is deliberately boring, because the hard real-time problems are solved elsewhere.
Figure 1. The production deployment. Two auto-scaling pools — SFU media servers and AI agent workers — sit between thin clients and a boring, reliable backend. Each box maps to a concrete 2026 technology in the next section.
Build Versus Buy: The 2026 Verdict, Component By Component
A capable team does not write all of this from scratch, and does not buy all of it either. The line in 2026 sits in a fairly stable place, and getting it right is the difference between shipping in a quarter and burning a year. The rule of thumb: adopt the hard real-time infrastructure, buy the fast-moving models, and build only the part that is your actual product.
| Component | Build or buy | Concrete 2026 choice | Why |
|---|---|---|---|
| Media server (SFU) | Adopt open source or buy managed | LiveKit (Apache 2.0, self-host or Cloud), mediasoup, or a managed video SDK | Real-time media has a long tail of network edge cases; re-learning it has no business upside |
| Agent framework | Adopt | LiveKit Agents (built-in turn detection, noise cancellation) | Solves turn-taking and the streaming speech pipeline that is tedious and error-prone to build |
| Speech-to-text | Buy as a service | Deepgram, AssemblyAI, or OpenAI Whisper API | Models improve monthly; self-hosting pays off only at large, steady volume |
| Translation | Buy as a service | A frontier LLM or a dedicated translation API | Same logic as speech-to-text; quality moves fast |
| On-device blur | Build on a free model | MediaPipe Selfie Segmentation on WebGPU | Must run on raw frames on the device; the model is free and the integration is yours |
| Noise suppression | Build on a model, or buy | DeepFilterNet (open, 48 kHz) or Krisp (paid SDK since May 2026) | Runs on raw audio on the device; pick open-source or paid by your volume and budget |
| TURN relay | Adopt open source | coturn (RFC 8656 / 8489) | A solved, standardized problem; never write your own NAT traversal |
| Token server, app backend, UX | Build | Your stack | This is your product — access rules, vertical features, and the experience are what differentiate you |
Two cells deserve a note because they changed in 2026. Noise suppression got more expensive to buy: Krisp moved its background-voice-cancellation SDK to metered, paid usage on 1 May 2026, while RNNoise — the old free standby — has not been maintained since 2024 and struggles with modern noise. The open, free choice that still holds up is DeepFilterNet, a deep-learning suppressor that runs on the device at full 48 kHz audio quality. And the media layer is now a true commodity you adopt: LiveKit's server is Apache-2.0 licensed and self-hostable, so the question is no longer "build or buy" but "self-host or use the managed cloud" — a cost decision we return to below. The per-model build-versus-buy reasoning for the audio features is the subject of lesson 6.6 on Krisp, Maxine, and Dolby.
Figure 2. What to build and what to adopt. The pattern is consistent: adopt the infrastructure, buy the models, build the product. Anything you build that is not your differentiator is a liability.
Following One Meeting From Click To Summary
Numbers and boxes become concrete when you trace a single meeting through the system. Follow one participant, Maria, joining a six-person team standup.
Maria clicks the meeting link. Her browser asks your application backend whether she may join; it checks her login and calls the token server, which mints a signed token naming her room and her permissions. None of this touches media — it is ordinary web traffic, and it finishes in well under a second.
Her browser uses the token to connect to the SFU. If her office firewall blocks a direct path, the connection automatically falls back through the TURN relay; either way, a media path opens. Before her video leaves the laptop, the on-device pipeline runs: the background-blur model decides, pixel by pixel, which parts of the frame are Maria and which are the wall behind her, and blurs the wall; in parallel, the noise-suppression model strips the air-conditioner hum out of her microphone. Only the cleaned, compressed streams travel to the SFU, which forwards them to the other five participants. Crucially, the SFU also makes the streams available to one more "participant" that joined when the meeting started: the agent.
The agent worker subscribes to the audio. It runs streaming speech-to-text — the model emits words as they are spoken rather than waiting for full sentences, which is what makes captions feel live — and pushes the resulting caption text back into the room, where every client overlays it. The same transcript feeds two more steps: a translation model renders the captions in Spanish for the one remote teammate who prefers it, and the notetaker quietly accumulates the full transcript. When the meeting ends, the agent makes a final pass over that transcript to produce a summary and action items, and your backend stores the result against Maria's team so anyone can read it later.
Notice the symmetry. Heavy per-person pixel and audio work happens on the devices, where the raw signal lives and where each new participant brings their own processor. Language work happens in one agent, where a model can see the whole conversation. The SFU in the middle stays lean, forwarding without transcoding. That shape is not decoration; it is exactly what keeps the system inside its time and cost budgets.
Figure 3. One meeting, end to end. Authentication and tokens are ordinary web traffic; media flows through the SFU; the agent listens and publishes language results back into the same room.
The Two Budgets You Cannot Break
There is no single latency number for this platform, because two loops run at once with different ceilings. Teams that confuse them optimize the wrong thing.
The interactive media budget governs whether people talk over each other: how long Maria's voice takes to reach the other five ears. The target is to keep one-way audio under about 200 ms. On-device features spend into this budget directly — a denoiser that adds 20 ms and a blur that adds 15 ms are both in the live path. The arithmetic is unforgiving. Suppose network transport costs 120 ms one way. Then the remaining budget is 200 − 120 = 80 ms, and that 80 ms must cover capture, on-device AI, encoding, and playout combined. That is why on-device models must be small, and why you never stack four of them without measuring.
The AI voice-loop budget governs whether an assistant feels responsive: how long it takes to hear a question and begin answering. Its practical ceiling is about 800 ms. A representative 2026 breakdown, drawn from production voice-agent platforms, is voice-activity detection and turn-taking 50 ms, final speech-to-text 150 ms, the language model's time to its first word 400 ms, text-to-speech first audio chunk 150 ms, and network 50 ms. Add them: 50 + 150 + 400 + 150 + 50 = 800 ms, the budget fully spent with nothing to spare. Above 800 ms the assistant feels slow; past 1,500 ms, callers report the exchange feels broken.
The lever that hides in plain sight is turn-taking — deciding the person has actually finished speaking, not just paused. A clumsy detector adds hundreds of milliseconds of delay that appear in no model benchmark, because the time is lost waiting, not computing. Modern frameworks replace the old fixed silence timer with a small model trained to predict the end of a turn from the words themselves: LiveKit's 2026 stack pairs an acoustic voice-activity detector with a roughly 135-million-parameter language model that runs locally and judges whether the sentence sounds finished. Streaming every stage — partial transcripts into the model, model words into the speech synthesizer before the sentence is done — is what keeps a good loop under one second. Lesson 6.1 on the sub-100 ms latency budget breaks these numbers down stage by stage.
Scaling: Three Tiers, Three Different Builds
How you deploy depends almost entirely on how many people are in a room at once, and the jump between tiers is where naive designs fall over.
Below 100 participants, a single SFU instance handles the room. This covers the overwhelming majority of business meetings, classes, and consultations. You still run a pool of SFU instances for reliability and to host many simultaneous rooms, but no single room needs more than one. This is the tier to launch in; do not engineer for the others until you have customers in them.
From 100 to about 1,000 participants, the room outgrows a single forwarding path for video, and you lean on two techniques. Simulcast means each camera sends several quality versions at once, so the SFU can forward a low-resolution copy to people who only need a thumbnail and a sharp copy to those who need detail. SVC, scalable video coding, packs those layers into one stream more efficiently. Together they let one SFU serve a large audience of mostly-watching participants while only a handful publish video. A well-tuned open-source SFU node handles roughly 500 to 800 concurrent video participants at this tier.
Above 1,000 participants, you cross into distributed territory: the room spans multiple SFU instances, often in different regions, linked into a mesh so that media enters the server nearest the speaker and is relayed to the servers nearest each audience. This "cascading" keeps distant viewers from paying the full network penalty on every frame, and LiveKit's design treats a relay link as just another participant. For the very largest broadcast-style events, teams often switch the watching majority to a streaming protocol entirely and keep WebRTC only for the active speakers — a hybrid covered in the conferencing playbook.
The agent pool scales on a different axis: by minutes of speech processed, not by head count. In a 200-person all-hands with one person speaking at a time, you transcribe one stream, not 200. That asymmetry is the key to the cost model, which is where many business plans quietly break.
Figure 4. Scale by room size on top; cost by where the work runs on the bottom. On-device work is free to the operator, SFU minutes scale per participant, and agent cost tracks minutes of speech — not attendee count.
A Cost Model With The Arithmetic Shown
Pricing this platform correctly means understanding that costs come in three shapes, and only one of them scales with the number of attendees. Walk through a concrete example: a one-hour, 25-person company meeting with all five AI features on.
The on-device features — background blur and noise suppression — run on each of the 25 laptops. That is 25 models running, but each runs on its owner's hardware and costs you, the operator, nothing. This is the quiet superpower of on-device placement: it scales for free with attendance, because every new participant brings their own processor.
The media minutes are what the SFU costs. On a managed platform in 2026, ordinary human-to-human connection time runs roughly $0.0004 to $0.0005 per participant-minute. For our meeting: 25 participants × 60 minutes × $0.0005 = $0.75 of media for the hour. Self-hosting the SFU replaces this with raw server and bandwidth cost, which is cheaper at high steady volume and more expensive once you count the engineers who keep it running.
The agent and model minutes are the AI cost, and they track speech, not attendance. With one person speaking at a time, you run about 60 minutes of streaming speech-to-text, 60 minutes of translation, and a summary pass. Streaming transcription in 2026 runs around $0.0077 per minute (Deepgram Nova-3), so 60 × $0.0077 = $0.46. The agent session itself, on a managed platform, runs about $0.01 per minute: 60 × $0.01 = $0.60. Translation and the final summary through a frontier model add a few cents more. Round the whole AI layer to roughly $1.20 for the hour.
Add them: about $0 on-device + $0.75 media + $1.20 AI ≈ under $2.00 for a fully AI-enhanced 25-person meeting-hour on managed infrastructure, before your own margins and the application backend. The shape matters more than the exact figure: a product that runs everything in the cloud would pay for 25 GPU slots when it needed one agent. Understand the shape and you price and scale correctly. The full set of cost levers — from self-hosting break-even to model selection — is the subject of lesson 8.4 on video-AI cost optimization.
One break-even worth remembering: self-hosting the media server starts to beat managed cloud at roughly 500 or more concurrent sessions sustained, the point where the saved per-minute fees outweigh the cost of a team running Kubernetes and on-call for media servers. Below that, managed cloud is cheaper once engineering time is counted. Decide this deliberately, not by reflex.
Common Mistake: Building The Cloud Bill Into The Architecture
The failure we are called to fix most often on conferencing products is not a bad model — it is a good design that put the work in the wrong place and now scales its own cost. Three versions recur, and all three are decided at architecture time, not in tuning.
The first is running per-person AI in the cloud instead of on the device. A team puts background blur on a server "to support old phones", then discovers that every participant now needs a slice of cloud GPU for the whole meeting, and the bill scales with attendance. Blur and noise suppression belong on the device that captured the raw signal; that placement is free to the operator and is the strongest privacy position, because the unmodified camera feed never leaves the room.
The second is making the SFU decode video to run a model inline. The SFU's entire efficiency comes from forwarding without transcoding. The moment you force it to decode every camera to run, say, a moderation model, you have rebuilt the expensive mixing server the SFU was designed to replace, and your per-room cost multiplies. The fix is the agent pattern: fan a copy of the stream to a separate worker that decodes and runs the model, leaving the SFU lean. Real-time moderation specifically belongs in that fan-out, the subject of lesson 6.12.
The third is running two noise suppressors at once. Browsers ship a free built-in suppressor; if you add your own without turning the built-in one off, the two fight and produce a robotic, underwater voice. Exactly one suppressor in the chain, on raw audio, before the audio is compressed — this is the single most common audio bug we see, and it is a one-line configuration fix that teams spend days chasing. The integration detail is in lesson 6.5 on noise suppression.
The Build Plan: Five Milestones, Value At Every Step
You do not build this all at once, and you do not build it in the order the diagram is drawn. You build it so that a working product exists after the first milestone and every later milestone ships independently, which keeps the project fundable and the team motivated.
Milestone 1 — a plain, reliable call. Stand up the SFU, the token server, and a TURN relay, and ship a meeting that just works: solid audio and video, reconnection after a dropped network, stability on the devices your users actually have. No AI yet. Everything else attaches to this foundation, and if it is shaky, no amount of AI will save the product. This is also where you make the self-host-versus-managed decision for the media layer.
Milestone 2 — on-device cleanup. Add background blur and noise suppression in the client. These are the features users notice in the first ten seconds, they cost the operator nothing per participant, and they need no new backend service. They are the highest value-to-effort ratio in the whole system, which is why they come before anything cloud-based. The how-to crowd searching "how to blur my background" converts here; lesson 6.3 is the deep dive.
Milestone 3 — the caption agent. Stand up one agent that joins the room and produces live captions with streaming speech-to-text. This is the moment the product becomes truly "AI-enhanced", and it establishes the agent pattern that every later language feature reuses. Lesson 6.9 on SFU-side caption fan-out covers the delivery pattern.
Milestone 4 — translation and notes, as extensions of the same agent. Once captions work, translation is a small addition: the transcript already exists, so you add a translation step and publish a second caption track (lesson 6.11). The notetaker reuses that same accumulated transcript to produce a post-meeting summary, the territory of lesson 6.14 on the LiveKit meeting assistant and lesson 6.18 on how the commercial notetakers are built. One agent, three language features, one transcript.
Milestone 5 — the regulated and the specialized. Content moderation, in-call avatars, and domain-specific assistants come last, because they are higher effort, narrower in appeal, or carry compliance weight. Moderation in particular touches both the trust model and, in regulated markets, the law. Build it deliberately rather than rushing it.
Figure 5. The build staircase. Each step ships a working product, so value compounds instead of waiting for one big launch — and funding and morale survive the project.
Production Concerns: Observability, Security, And The Law
A meeting that works in a demo is not a product. Three cross-cutting concerns separate a prototype from something you can sell.
Observability means you can see why a call went wrong after it ended. Real-time media fails in ways ordinary web apps do not — a participant on a bad network, a region with packet loss, an agent that fell behind. You want per-session quality metrics (connection time, packet loss, audio level, agent latency) collected centrally so support can answer "why was my meeting choppy?" without guessing. Build this in from Milestone 1; retrofitting telemetry into a live media system is painful.
Security starts with the token server you already built. Media is encrypted in transit by default in WebRTC, but access control is yours to enforce: signed, short-lived tokens, server-side checks on who can host or record, and time-limited credentials on the TURN relay so it cannot be abused as an open proxy. For meetings that demand it — telemedicine, finance — you add end-to-end encryption so that even the SFU cannot read the media, and you choose self-hosting so data never leaves your infrastructure. Compliance regimes like SOC 2 and, for health data, HIPAA are achievable on this architecture; they are configuration and process, not a different design.
The law has a new, dated deadline you must plan around. The European Union's AI Act, Regulation (EU) 2024/1689, makes its transparency rules under Article 50 apply from 2 August 2026. The rule that touches this platform: where your product generates or substantially alters a person's image or voice — an AI avatar, a cloned or synthesized voice, a translation spoken in the user's own voice — participants must be told the content is artificially generated. Build that disclosure in from the start; it is a small feature early and an expensive retrofit late. The full regulatory picture, including the biometric rules, is in lesson 8.5 on EU AI Act engineering. This is engineering context, not legal advice — confirm the specifics with counsel for your market.
Where Fora Soft Fits In
Fora Soft has built real-time video software since 2005, and the platform described here — an SFU for media, an agent pattern for AI, on-device cleanup, and a boring reliable backend — is the backbone of the conferencing, e-learning, and telemedicine products we ship. We run production LiveKit deployments that pair the SFU with the Agents framework at sub-300-millisecond latency, scale them horizontally on Kubernetes, and stand them up to SOC 2 and, where health data is involved, HIPAA expectations. The build order and the build-versus-buy verdicts in this article are not theory for us; they are the checklist we apply when scoping a conferencing product, because they are the difference between a feature that ships and one that quietly makes every meeting worse or every invoice larger. Our verticals here are video conferencing, e-learning, telemedicine, and surveillance, where real-time AI on live video is the core of the product rather than a decoration.
What To Read Next
- Phase 6 Capstone — the three-plane architecture of an AI call
- Lesson 6.14 — LiveKit real-time AI meeting assistant: architecture and pricing
- Lesson 9.1 — AI in video conferencing: the engineering playbook
Talk To Us · See Our Work · Download
- Talk to a video engineer about scoping a real-time AI conferencing build: /contact
- See our WebRTC and LiveKit work in conferencing, e-learning, and telemedicine: /webrtc-software-development-experts
- Download the Real-Time AI Conferencing Platform Build Blueprint (one page): the blueprint PDF
References
- W3C — WebRTC 1.0: Real-Time Communication Between Browsers (W3C Recommendation, 26 January 2021; the SFU/peer media model the platform is built on). https://www.w3.org/TR/webrtc/
- IETF — RFC 8825: Overview — Real-Time Protocols for Browser-Based Applications (the WebRTC architecture overview). https://www.rfc-editor.org/rfc/rfc8825
- IETF — RFC 6716: Definition of the Opus Audio Codec (September 2012; the codec every WebRTC endpoint speaks, so audio AI must run on raw audio before it). https://www.rfc-editor.org/rfc/rfc6716
- IETF — RFC 8656: Traversal Using Relays around NAT (TURN) (the relay standard coturn implements; required for ~15–20% of networks). https://www.rfc-editor.org/rfc/rfc8656
- IETF — RFC 8489: Session Traversal Utilities for NAT (STUN) (the companion standard for discovering a device's public address). https://www.rfc-editor.org/rfc/rfc8489
- European Union — Regulation (EU) 2024/1689 (AI Act), Article 50 — Transparency obligations (disclosure of AI-generated or manipulated audio/video; transparency rules apply from 2 August 2026). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
- LiveKit — Agents documentation (an agent is any Python/Node program joining a room as a full real-time participant; built-in turn detection and noise cancellation). https://docs.livekit.io/agents/
- LiveKit — Self-hosting overview and Pricing (Apache-2.0 server; managed WebRTC minutes ~$0.0004–0.0005/min and agent session minutes ~$0.01/min in 2026). https://docs.livekit.io/home/self-hosting/deployment/
- LiveKit — How we built a globally distributed mesh network to scale WebRTC (cascaded SFUs treat a relay link as another participant; tier guidance for 100/1K/10K). https://blog.livekit.io/scaling-webrtc-with-distributed-mesh/
- Deepgram — Pricing (Nova-3 streaming speech-to-text ≈ $0.0077/min in 2026; Flux tuned for end-of-speech detection latency). https://deepgram.com/pricing
- Smallest.ai — Designing Voice Assistants: STT, LLM, TTS, Tools, and Latency Budget (2026 800 ms voice-loop breakdown: VAD 50, STT 150, LLM TTFT 400, TTS 150, network 50). https://smallest.ai/blog/designing-voice-assistants-stt-llm-tts-tools-and-latency-budget
- Rikorose — DeepFilterNet (open-source full-band 48 kHz deep-learning noise suppression for telephony and WebRTC). https://github.com/Rikorose/DeepFilterNet
- Google AI Edge — MediaPipe Image Segmenter / Selfie Segmentation (the on-device segmentation model behind real-time background blur in the browser). https://ai.google.dev/edge/mediapipe/solutions/vision/image_segmenter
- Google — WebGPU now supported in major browsers (web.dev; the on-device graphics API that makes live in-browser blur fast enough). https://web.dev/blog/webgpu-supported-major-browsers


