Published 2026-05-24 · 22 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If your product is a video conferencing tool, a telemedicine consultation app, an asynchronous video messaging product, an online tutoring platform, or anything that puts a webcam on screen, your users expect to blur the room behind them. They expect it to work without installing software, without giving up frame rate, without paying for it, and without making the call look like a green-screen TV anchor with a fuzzy halo around their hair. The companies that get this wrong — and several large ones do — lose users to the ones that get it right. The pipeline described here is the same architecture that ships in Google Meet, in the Zoom Web SDK, in 100ms and Daily and LiveKit room widgets, and in every white-label conferencing tool we have built since 2024. This article assumes you have read lesson 2.1 on computer-vision preprocessing and lesson 2.4 on SAM 2 — because background blur is a special case of semantic segmentation, and the segmentation primitives in those lessons sit at the foundation of every effect we are about to describe. By the end you will be able to specify the pipeline, pick the right rendering backend, plan the cross-browser fallback, and avoid the four mistakes that account for almost every blur-quality complaint we see in production.

What "Background Blur" Actually Is — Pixels, Not Magic

The effect that everyone calls background blur is a two-stage image processing pipeline. The first stage is semantic segmentation — for every pixel of the input frame, a small neural network emits a probability between 0 and 1 that the pixel belongs to a person rather than to the room. The second stage is compositing — using that probability mask, the renderer mixes the sharp original frame (kept for the person pixels) with a Gaussian-blurred copy of the same frame (used for the room pixels). The result looks like depth-of-field on a real camera, but no camera was involved — the blur is software, the depth is inferred, and the boundary between sharp and blurred is dictated by a 256×256 grayscale mask.

A useful analogy is rotoscoping in traditional animation. The mask is the same idea as a painter cutting a silhouette out of a sheet of paper and laying it over a different background. The neural network is the painter; the rest of the rendering pipeline is the act of laying the silhouette over a blurred copy of the room. The model is the only "intelligent" component; everything around it is plumbing.

The reason the effect feels modern is the model. Five years ago, segmenting a person from the background in real time required either a depth camera (Microsoft Kinect, iPhone TrueDepth) or a server-side GPU. The smartphone-grade mobile architectures that landed between 2019 and 2022 — MobileNetV3 and its derivatives — made it possible to run real-time segmentation on the same CPU that decodes your webcam feed. The model used by MediaPipe's selfie segmentation, and by ML Kit, and by a variant of Google Meet's blur, is one of those mobile architectures.

The MediaPipe Selfie Segmentation Model — Numbers That Matter

The MediaPipe selfie segmenter is a binary semantic segmentation network with 106,000 parameters and a file size of 447 KB in float32, dropping to around 230 KB after INT8 quantization. The architecture is a custom MobileNetV3 backbone modified with the encoder-decoder structure used in U-Net-style segmenters: the backbone produces multi-scale features that get up-sampled and combined to produce a single-channel output the same height and width as the input. The model card describes two variants: a general model that takes a 256×256 RGB input and emits a 256×256 mask, and a landscape model that takes a 144×256 input and emits a 144×256 mask. The landscape model has roughly one-third the FLOPs of the general model and runs proportionally faster; it is the model that powers a variant of Google Meet's background blur.

Latency, measured by Google on a Pixel 6 with the GPU backend, is around 2 milliseconds per frame for the general model and around 1 millisecond per frame for the landscape model. On a modern laptop CPU (Apple M2, 12th-gen Intel Core i7) running through the WebAssembly SIMD backend that ships with the @mediapipe/tasks-vision package, the same model runs in 5–10 milliseconds per frame — fast enough for 30 fps webcam input with budget left over for the compositing step. On the WebGPU backend, the model runs in 1–3 milliseconds per frame on the same hardware, leaving ~30 milliseconds of the 33 ms-per-frame budget for everything else.

The model emits a six-class output (background, hair, body skin, face skin, clothes, others) in its full form, but for background blur we only care about the first axis: person versus not person. Most production pipelines collapse the six-class output to a binary mask with a single softmax argmax step, or simply add the five person-class channels together and threshold the sum.

Architecture diagram showing the MediaPipe selfie-segmentation pipeline. A 256x256 webcam frame enters the MobileNetV3 encoder, passes through skip-connected decoder stages, and emits a 256x256 probability mask. The mask is then up-sampled to the original frame resolution and used by the WebGPU compositor to mix the sharp foreground with a blurred copy of the original frame. Figure 1. The end-to-end pipeline. The model itself is tiny; most of the engineering work lives in the input preprocessing, mask upscaling, and compositing stages.

The model has two known weaknesses, and every production deployment has to engineer around them. First, the boundary is only as sharp as the 256-pixel resolution allows — a 1080p webcam feed gets its mask up-sampled by roughly 4× in each dimension, and a naive bilinear up-sample produces a soft edge that looks fine on the cheek but bad on individual strands of hair. Second, the model was trained on single-person, head-and-shoulders, indoor footage taken at consumer-camera distance; it generalises poorly to multiple people, to extreme close-ups, to back-lit subjects, and to scenes shot at a wide-angle distance from the camera. Both weaknesses are addressable, and we cover the fixes below.

Why WebGPU Changes The Math

The model used to run on WebGL. It still can — and as of 2026 WebGL is still the right backend for users on Safari 17 and earlier, on Firefox versions older than the 2025 WebGPU-enabled releases, and on roughly 8% of Chrome users globally whose hardware blacklists WebGPU. WebGL works because the segmentation network is a stack of convolutions, and a fragment shader can be coerced into multiplying matrices by encoding tensors as textures and treating each draw call as a compute dispatch. This is the same kind of workaround that the TensorFlow.js team built and that the original MediaPipe web port shipped with — and it ranges from "fast enough" to "painful" depending on driver and GPU.

WebGPU was designed for general compute, not for graphics. A compute shader has direct buffer access, shared workgroup memory, and explicit synchronisation; it can dispatch a 16×16×1 workgroup of threads against an arbitrary tensor without ever pretending to be a triangle pipeline. The practical effect, on the selfie segmentation network, is a 3–5× speed-up over WebGL on the same hardware and a 30–40% reduction in GPU memory transfer overhead. Vendor benchmarks from the Sitepoint 2026 WebGPU survey (Google, NVIDIA, Intel) report that for AI inference more broadly, WebGPU delivers 15–30× speed-ups on transformer architectures over WebGL — selfie segmentation is on the lower end of that range because it is a pure-CNN model with small per-layer activation, but the gain is still substantial.

The other reason WebGPU matters is browser support. Chrome and Edge enabled WebGPU by default on desktop in May 2023 (version 113) and on Android in 2024 (version 121). Firefox 141, released in mid-2025, enabled it by default on Windows. Safari 18, released in late 2024, enabled it by default on macOS, iPadOS, and iOS. As of May 2026, somewhere between 92% and 96% of global Chrome users, 100% of recent-Safari users, and the majority of recent-Firefox users have WebGPU available. That's enough coverage to make WebGPU the primary backend and WebGL the fallback, which is the inversion that motivated rewriting the pipeline.

Bar chart comparing model inference latency across three backends on the same MacBook Pro M2 hardware: WebAssembly SIMD CPU at around 8 ms per frame, WebGL GPU at around 4 ms per frame, and WebGPU compute at around 1.5 ms per frame. Each bar is labelled with the percentage of the 33 ms 30-fps frame budget consumed. Figure 2. The selfie-segmentation model on three browser backends. WebGPU is the only backend that leaves the full compositing budget intact on a 30 fps feed.

The Full Pipeline — From getUserMedia To WebRTC

Production background blur is not a single API call. It is a four-stage pipeline that starts at the camera and ends at the WebRTC peer connection.

Stage one is capture. The navigator.mediaDevices.getUserMedia() call returns a MediaStream containing a single MediaStreamTrack of video. That track is the source for every downstream stage. For 720p input at 30 fps, the frames arrive as 1280×720 VideoFrame objects through the WebCodecs API.

Stage two is inference. Each frame is handed to MediaPipe Image Segmenter via the segmentForVideo(videoFrame, timestamp) method, which down-samples the input to 256×256, runs the model on the WebGPU backend, and returns a MPMask object. The mask is a GPU texture by default on WebGPU — it never has to round-trip back to CPU memory, which is half the reason the pipeline is fast.

Stage three is compositing. A WebGPU fragment-shader pipeline takes three inputs: the original 1280×720 frame, a Gaussian-blurred copy of the same frame at the same resolution, and the 256×256 mask. For each output pixel, the shader samples the mask (with bilinear filtering), uses the mask value as the blend weight between the sharp and blurred frames, and writes the result to an offscreen canvas. The Gaussian blur itself is two separable passes (horizontal then vertical) of a 1-D 21-tap kernel with a sigma chosen to match the visual effect the product wants — sigma 8 for a Zoom-like medium blur, sigma 16 for a heavy blur, sigma 4 for the "barely there" professional look.

Stage four is encoding. The offscreen canvas is wrapped in a MediaStreamTrackGenerator (Chrome) or VideoTrackGenerator (Safari, Firefox via the spec API), which exposes a MediaStreamTrack that looks to WebRTC exactly like a camera track. That track is added to the RTCPeerConnection with pc.addTrack(processedTrack, processedStream), and from the recipient's perspective there is no difference between a blurred and an unblurred sender.

The Insertable Streams machinery is the part that ties stages two through four together. The MediaStreamTrackProcessor reads the camera track as a ReadableStream of VideoFrame objects; you transform those frames through stages two and three; and you write the result into a VideoTrackGenerator to get a WritableStream of VideoFrame objects back. The transformation happens entirely off the main thread inside a Worker, which keeps the UI responsive even on a 4-year-old laptop.

// Skeleton — running inside a Worker, WebGPU backend assumed.
import { ImageSegmenter, FilesetResolver } from "@mediapipe/tasks-vision";

const vision = await FilesetResolver.forVisionTasks(
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision/wasm",
);
const segmenter = await ImageSegmenter.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: "selfie_segmenter.tflite",
    delegate: "GPU",          // WebGPU on Chrome/Edge/Firefox/Safari
  },
  runningMode: "VIDEO",
  outputCategoryMask: true,
  outputConfidenceMasks: false,
});

const processor = new MediaStreamTrackProcessor({ track: cameraTrack });
const generator = new VideoTrackGenerator();
const reader = processor.readable.getReader();
const writer = generator.writable.getWriter();

while (true) {
  const { value: frame, done } = await reader.read();
  if (done) break;
  const mask = segmenter.segmentForVideo(frame, performance.now()).categoryMask;
  const composited = compositorRun(frame, mask, blurSigma);  // WebGPU shader
  await writer.write(composited);
  frame.close();
  mask.close();
}

The full implementation is roughly 350 lines of TypeScript including the WebGPU shader. We keep an open-source reference in the Fora Soft GitHub organisation, and we ship the same pattern as the default in our LiveKit and 100ms client wrappers.

Cross-Browser Reality — A Capability Detection Ladder

Insertable Streams is not yet a uniformly supported API. Chrome and Edge ship Google's pre-spec MediaStreamTrackProcessor and MediaStreamTrackGenerator on the main thread and in workers, with both audio and video tracks. Safari 18 Tech Preview and Firefox 141+ ship the W3C-standard variants — MediaStreamTrackProcessor and VideoTrackGenerator — but only inside a worker and only for video tracks. The standard variant is what Chrome will eventually align to, and the pre-spec variant is still the most widely deployed.

The right engineering response is a capability-detection ladder that picks the best available pipeline at runtime, rather than a hard browser-version check. The ladder looks like this, top to bottom:

Tier Capability check Pipeline
1 WebGPU + Insertable Streams + Worker WebGPU model + WebGPU compositor + Worker — 1–3 ms model, 30+ fps
2 WebGL + Insertable Streams + Worker WebGL model + WebGL compositor + Worker — 4–8 ms model, 30 fps
3 WebGPU + canvas.captureStream() WebGPU model + canvas re-capture — 30 fps but adds 16 ms latency
4 WebAssembly CPU + canvas.captureStream() WASM SIMD model + 2D canvas blur — 15 fps acceptable, no GPU at all
5 None Disable the feature, surface a "your browser doesn't support background blur" toast

The reason the ladder matters is that hitting tier 4 on a Safari 17 desktop user gives a visibly worse result than hitting tier 1 on a Chrome user on the same hardware, and users on tier 4 will report it as a bug. We have learned to surface the active tier to the user in a small unobtrusive UI hint — "background blur is using your CPU; quality may be reduced" — because users who understand the trade-off file fewer support tickets than users who don't.

A common production mistake is to pick tier 1 without verifying that the user's GPU isn't on Chrome's WebGPU blocklist. Around 8% of Chrome desktop users have WebGPU available in the API but disabled at runtime by the GPU blocklist (typically older Intel HD Graphics on Windows). Capability detection means checking that the WebGPU adapter request actually succeeds, not just that navigator.gpu is defined.

A Numeric Example — Frame Budget On A 30 Fps Webcam

A 30-fps feed gives you 33.33 milliseconds per frame to do everything: read the frame, downsample it, run the model, upsample the mask, blur the frame, composite, and hand it back to WebRTC. Anything that overruns the budget either drops frames or adds latency, both of which the user notices.

Worked example on a MacBook Pro M2 with Chrome 138 and a 1280×720 webcam input. The model runs on WebGPU and reports 1.5 ms median inference time over a 1-minute capture. The down-sample from 1280×720 to 256×256 takes 0.3 ms (a single GPU draw). The Gaussian blur, two separable passes at sigma 8, takes 1.8 ms total. The composite shader takes 0.4 ms. The WebCodecs encode-back-into-a-VideoFrame step takes around 0.6 ms. Total: 4.6 ms per frame. The browser's compositor and the WebRTC encoder consume another ~5 ms before the frame leaves the box, and the budget still has 23 ms of headroom.

On a 6-year-old Intel Core i5 laptop without a discrete GPU, the same pipeline runs on WebGL at tier 2: 7 ms model inference, 1 ms down-sample, 4 ms blur, 1 ms composite, 1 ms encode — 14 ms total. Still under budget. On Safari 17 on the same hardware, tier 4 kicks in: 18 ms model on WASM SIMD CPU, 6 ms blur on 2D canvas, 2 ms composite — 26 ms total, leaving almost no headroom and a visibly soft mask edge.

The takeaway is that the WebGPU pipeline is not just faster; it is the only pipeline that leaves enough headroom for the compositor to do clean work. Tier 4 is running but the visual quality is meaningfully lower.

Three Pitfalls We See Teams Ship

These are the three failure modes we have walked into in the field and that we now check for in every code review before a background-blur feature ships.

The first pitfall is the soft-edge halo. The 256×256 mask, naively bilinear-upsampled to 1280×720, produces a 3-to-4-pixel soft band around the silhouette where the mask value is somewhere between 0 and 1. In that band, the compositor mixes sharp and blurred pixels, and the result reads as a bright halo on a dark background or a dark halo on a bright one. The fix is a two-step post-processing step on the mask itself: a small Gaussian blur with sigma 2 (to remove the staircase artefacts), followed by a steep sigmoid remap that pushes mid-mask values toward 0 or 1 (mask = 1 / (1 + exp(-12 * (mask - 0.5)))). Both steps live in the WebGPU compositor shader; they cost nothing measurable. The result is a one-pixel-thick transition band that reads as natural depth of field rather than a fluorescent outline.

The second pitfall is hair flicker. The model was trained at 256×256 resolution, and the hair-versus-background boundary at that resolution is on the order of a single pixel wide. Frame-to-frame the model emits slightly different boundary masks even on perfectly stationary footage — the result reads as the hair "shimmering" when the user holds still. The fix is temporal smoothing: keep a small ring buffer of the last 3 frames' masks, and emit the average. Three frames is enough to remove the high-frequency flicker; more than five frames introduces visible lag when the user moves. The cost is one extra texture sample per output pixel per buffered frame; on WebGPU it is invisible in the budget.

The third pitfall is multi-person scenes. A meeting room with two people sitting side by side at one camera reveals that the model treats both people as "person" pixels — there is no instance segmentation, no left-person-versus-right-person separation. For background blur this is usually fine; users want both faces sharp. But the moment a product feature requires single-person segmentation (a beauty filter that should only apply to the user, an avatar that should only replace one face), the right answer is not to fix MediaPipe — it is to upgrade to SAM 2 (lesson 2.4) with the user's face as the prompt. MediaPipe is for the room versus people boundary; SAM 2 is for the this specific person versus everyone else boundary. Using the wrong tool for the wrong job is the single most common architectural mistake we see in this space.

A side-by-side comparison of three rendered outputs: a naive mask-upsample with visible halo, a sigmoid-remapped mask with clean edge, and a temporal-smoothed mask on hair detail. Each panel is labelled with the engineering technique that produced it. Figure 3. The three pitfalls and their fixes. The middle and right panels are the production-quality references; the left panel is what ships when the pipeline is taken from the MediaPipe README without further work.

Build Versus Buy — When MediaPipe Is The Wrong Choice

For most web-based video products, MediaPipe Image Segmenter on WebGPU is the right primitive. It is free, it is permissively licensed (Apache 2.0), it runs entirely on the user's device (which means zero per-user cloud cost and zero PHI / GDPR data exfiltration), and the quality is high enough that the average user cannot tell it from Zoom or Meet. We ship it as the default in our standard web conferencing reference architecture.

There are three situations where buying a commercial SDK (Krisp, NVIDIA Maxine, Banuba, Persona, agora.io background-blur extension) makes more sense than building on MediaPipe.

The first is when you also need real-time noise suppression, voice activity detection, and AI-based echo cancellation in the same product, and you want one vendor to support all of them. Krisp ships background blur as part of the same SDK that ships its noise-suppression model; the integration cost saves engineering time even if the per-feature quality is comparable to MediaPipe.

The second is when you need a hardware-accelerated effect on a native desktop app rather than the browser. NVIDIA Maxine runs on the user's RTX GPU via CUDA and delivers higher-quality results on supported hardware — but it requires an RTX-class GPU on the user's machine, which excludes the long tail of laptops and Macs. For a Windows-only enterprise desktop product targeting recent RTX hardware, Maxine is a legitimate choice. For a cross-platform web product, it is not.

The third is when you need signed, audited, AVA-certified background-blur on a regulated product (healthcare, defence, finance). Commercial vendors sell SOC 2 reports, HIPAA BAAs, EU AI Act conformance documents, and indemnification on the model's behaviour. MediaPipe is Apache-licensed code; the audit story is on you. For a telemedicine product where you negotiate vendor agreements with hospital procurement, the BAA from a commercial vendor is sometimes worth the per-user fee.

For everything else — and that is the vast majority of products — MediaPipe on WebGPU is the right answer in 2026. Cross-link to lesson 6.6 on Krisp / Maxine / Dolby build-versus-buy for the more detailed vendor-by-vendor breakdown.

Where Fora Soft Fits In

Fora Soft has shipped background blur as part of WebRTC-based conferencing products since 2021, and we maintain the WebGPU compositor described in this article as a reusable component across our video conferencing, telemedicine, e-learning, and AR/VR projects. We have rolled out the same pipeline behind whitelabel conferencing tools for European telehealth platforms (where the on-device processing satisfies GDPR data-locality requirements), for online tutoring products serving children (where the room background can include other family members that should be blurred), and for OTT post-production workflows where editors review raw interviews and want a clean preview without renting a studio. If you are building any of those, the build pattern is the same; the difference is the verticalisation around the rest of the product.

What To Read Next

Talk To Us / See Our Work / Download

References

  1. Google AI Edge. Image segmentation guide — MediaPipe Image Segmenter task documentation. https://ai.google.dev/edge/mediapipe/solutions/vision/image_segmenter (accessed 2026-05-24). The canonical task documentation for the Image Segmenter and the selfie-segmenter model.
  2. google-ai-edge/mediapipe. Selfie Segmentation solution — model card and documentation. https://github.com/google-ai-edge/mediapipe/blob/master/docs/solutions/selfie_segmentation.md (accessed 2026-05-24). Source of the 106K parameter count, 447 KB model size, 256×256 and 144×256 input resolutions.
  3. W3C Media Capture and Streams WG. Insertable Streams for MediaStreamTrack — W3C draft specification. https://www.w3.org/TR/mediacapture-transform/ (accessed 2026-05-24). The standardised MediaStreamTrackProcessor / VideoTrackGenerator API surface used by Safari 18 and Firefox 141+.
  4. MDN Web Docs. Insertable Streams for MediaStreamTrack API. https://developer.mozilla.org/en-US/docs/Web/API/Insertable_Streams_for_MediaStreamTrack_API (accessed 2026-05-24). Browser-support matrix and worker-vs-main-thread differences.
  5. W3C. WebGPU — W3C Candidate Recommendation. https://www.w3.org/TR/webgpu/ (accessed 2026-05-24). Authoritative source for the WebGPU compute pipeline used by the segmenter and compositor.
  6. W3C. WebCodecs — W3C Candidate Recommendation. https://www.w3.org/TR/webcodecs/ (accessed 2026-05-24). The VideoFrame API that wraps each camera frame as a GPU-backed handle.
  7. IETF. RFC 8829 — JavaScript Session Establishment Protocol (JSEP). https://www.rfc-editor.org/rfc/rfc8829 (accessed 2026-05-24). The WebRTC offer/answer machinery into which the processed MediaStreamTrack is wired.
  8. Mozilla Hacks. Unbundling MediaStreamTrackProcessor and VideoTrackGenerator — Firefox 141 ship note (Nov 2025). https://blog.mozilla.org/webrtc/unbundling-mediastreamtrackprocessor-and-videotrackgenerator/ (accessed 2026-05-24). Worker-only video-only standard implementation in Firefox.
  9. Chrome Platform Status. MediaStreamTrack Insertable Streams (Breakout Box). https://chromestatus.com/feature/5499415634640896 (accessed 2026-05-24). Chrome's pre-spec implementation, shipped in 2021.
  10. WebRTC Hacks / WebRTC.Ventures. Background Removal Using Insertable Streams (2023). https://webrtc.ventures/2023/02/background-removal-using-insertable-streams/ (accessed 2026-05-24). Detailed walkthrough of the Insertable Streams + segmentation pattern; the architectural ancestor of the pipeline described in this article.
  11. SitePoint. WebGPU vs. WebGL: Performance Benchmarks for Client-Side Inference (2026). https://www.sitepoint.com/webgpu-vs-webgl-inference-benchmarks/ (accessed 2026-05-24). Source for the 15–30× WebGPU-over-WebGL speed-up on transformer architectures and 3–5× on CNN architectures.
  12. Howard, A., et al. Searching for MobileNetV3. arXiv:1905.02244 (2019). https://arxiv.org/abs/1905.02244 . The architecture family from which the selfie-segmenter backbone derives.
  13. Qualcomm AI Hub. MediaPipe-Selfie-Segmentation model card. https://aihub.qualcomm.com/models/mediapipe_selfie (accessed 2026-05-24). Independent third-party model card with cross-platform latency benchmarks (e.g., 0.733 ms on Galaxy S23 Ultra).
  14. WebRTC Project. getUserMedia — Media Capture and Streams W3C Recommendation. https://www.w3.org/TR/mediacapture-streams/ (accessed 2026-05-24). The standard for the camera-track input side of the pipeline.
  15. NVIDIA Maxine. Background Blur — Video Effects (VFX) SDK User Guide v1.1.0. https://docs.nvidia.com/maxine/vfx/1.1.0/Filters/BackgroundBlur.html (accessed 2026-05-24). Reference for the commercial-vendor build-versus-buy comparison.