Beauty Filters, Gaze Correction, AR Effects

Why This Matters

If your product shows a person's face on camera — video conferencing, telemedicine, a virtual classroom, a dating app, live shopping — users now expect the same effects they get on Snapchat and TikTok: a flattering filter, eyes that meet the camera, a virtual hat or pair of glasses they can try on. These features feel like polish, but they decide whether people turn their camera on at all, which is the difference between an engaged call and a grid of black tiles. This article is for the product manager, founder, or engineering lead who has to decide whether to build these effects, buy them, or skip them, and who needs to talk to engineers about cost, performance, and legal risk without getting lost. By the end you will understand the one technology underneath all three effects, where it runs in a real-time call, what it costs in milliseconds and licence fees, and the specific regulation that turns a harmless beauty filter into a compliance problem the moment it tries to read a mood. For the related problem of replacing the background behind the face, see our companion articles on background blur and virtual backgrounds; this one is about effects on the face itself.

One Map, Three Jobs

Start with the idea that ties everything together, because it makes the rest simple. A beauty filter, a gaze corrector, and a virtual-hat effect are not three unrelated technologies. Each one first answers the same question — where exactly is this face, and where is every part of it right now — and only then does its own specific job. Build or buy that one answer well, and all three effects become variations on a theme rather than three separate projects.

The answer is a set of tracked points called landmarks. A facial landmark is a single labelled point on the face — the inner corner of the left eye, the tip of the nose, a spot on the jaw — that a model finds and follows in every video frame. Connect those points and you get a mesh: a fine net stretched over the face that bends and turns as the real face moves. That mesh is the shared map. Think of it like the wireframe an animator rigs before adding a character's skin — once the wireframe tracks the face, anything you attach to it tracks too.

Each effect reads the same map differently. A beauty filter uses the mesh to know which pixels are skin (smooth those) and which points form the jaw or nose (nudge those). Gaze correction uses the map to find the small rectangle around the eyes, then redraws only that patch. An AR effect uses the map's 3D shape and orientation to place a virtual object so it sits on the face at the right angle. Same map, three jobs.

A concept diagram showing a single central face-mesh map feeding three downstream effects. In the center, a stylized face covered in a net of landmark points labelled Figure 1. The shared foundation: one real-time face-landmark map feeds beauty filters, gaze correction, and AR effects alike.

The Map Itself: MediaPipe Face Landmarker

The most common way to build that map for free is a Google tool called the MediaPipe Face Landmarker. MediaPipe is Google's open-source kit of ready-made, on-device machine-learning features for video; the Face Landmarker is the one that finds and tracks faces. It runs on phones, in browsers, and on servers, and it is the default starting point for teams that build face effects without paying a vendor.

The Face Landmarker produces three things from each frame, and knowing them tells you exactly what your effects have to work with. The first is the mesh: an estimate of 478 three-dimensional points on the face. That number is worth pausing on — 468 of those points cover the face surface, and the remaining 10 sit on the eyes, five per eye, tracking the colored ring called the iris that gaze effects depend on. The original 468-point face mesh came out of a 2019 Google Research paper that showed a single small network could infer this dense a face shape in real time on a phone; the iris points were added later, bringing the total to 478.

The second output is a set of 52 blendshape scores. A blendshape is a number from 0 to 1 that measures how much one specific expression is happening right now — how open the jaw is, how raised the left eyebrow is, how much the mouth is smiling. Fifty-two of these together describe the face's current expression as a list of dials. This is the output that drives a cartoon avatar that mimics your face, and, as we will see, the output that draws regulatory attention.

The third output is a transformation matrix, which is a compact block of numbers describing how the face is positioned in space — turned left, tilted up, leaning toward or away from the camera. An AR effect needs this to place a virtual object at the correct angle. Together these three outputs — the 478-point mesh, the 52 expression dials, and the position matrix — are everything the three effects read.

Figure 2. Inside the MediaPipe Face Landmarker: a detector finds the face, a mesh model outputs 478 landmarks, and two more outputs give expression scores and head position.

How A Beauty Filter Actually Works

A beauty filter feels like magic and is mostly two plain operations anchored to the mesh: smoothing the skin and reshaping a few features. Neither is mysterious once you see how the map guides them.

Skin smoothing is the headline effect, and it uses a specific tool called a bilateral filter. An ordinary blur softens everything, which would erase eyelashes and turn the face to mush. A bilateral filter is a smarter blur: it averages a pixel only with neighbors of similar brightness, so it smooths an even patch of cheek while leaving sharp edges — the line of an eyelash, the border of a lip — untouched. The mesh tells the filter where the skin is, so it smooths cheeks and forehead but skips the eyes, mouth, and hair. The result keeps detail where the eye expects detail and softens only the flat skin areas.

Reshaping is the second operation, and it is geometry. Because the filter holds the face as a mesh of points, it can move chosen points a little and stretch the image to follow. To slim a jaw, it pulls the jawline points inward; to enlarge eyes, it pushes the points around each eye outward. The movements are small, and that smallness is the whole craft. Suppose the face is 400 pixels wide and the filter slims the jaw by three percent of that width. The math is one line:

jaw shift = 3% × 400 px
jaw shift = 0.03 × 400
jaw shift = 12 px inward on each side

Twelve pixels is enough to read as "slimmer" and small enough that the person still looks like themselves. Push it to thirty percent and you get the uncanny, melted look that draws complaints. A good beauty filter lives in that narrow band, and the band is why "subtle" is an engineering target, not just a taste.

The third, lightest touch is color: gently brightening the eyes, whitening teeth, or warming skin tone, each one masked to the right region by the mesh. Stack these — smooth, reshape, recolor — and you have the full effect, all of it riding on the same landmark map.

Gaze Correction: Making Eyes Meet The Camera

Gaze correction solves a problem built into every video call. Your camera sits above your screen, but you look at the face on the screen, not at the lens — so to the other person you always appear to be looking slightly down. Over a long call this quietly drains the sense of contact. Gaze correction fixes it by redrawing your eyes so they appear to meet the camera while you keep looking at the screen.

There are two ways this ships, and the difference matters for what you build. The first is a cloud or server feature, the clearest example being NVIDIA's Maxine Eye Contact. It works by cropping the small rectangle around your eyes — the "eye patch" — using a face-tracking step that finds your eye landmarks and head angle. A neural network then estimates where your eyes are actually pointing, synthesizes a version pointing at the camera, and blends that patch back into the frame. A nice detail shows the care in it: when your gaze drifts too far away, the effect eases back to your real eyes instead of snapping, so it never looks like a glitch. Maxine runs on NVIDIA graphics cards, which means you either ship it on a user's NVIDIA-equipped machine or run it on your own servers — the same build-versus-buy question we cover in the Krisp, Maxine, and Dolby article.

The second way is the phone doing it for you. On an iPhone, Apple's FaceTime feature called Eye Contact uses the front TrueDepth camera — the same depth sensor behind Face ID — to build a 3D map of your face, locate the eyes, and adjust them in real time. It has been on by default on recent iPhones for years. The catch for a product builder is that this only helps inside FaceTime and only on Apple hardware; you cannot call it from your own web app. So the practical rule is simple: if your users are on the web or on mixed devices and you need gaze correction everywhere, you reach for a model like Maxine; if they happen to be in FaceTime on a recent iPhone, Apple already did it and you do nothing.

AR Effects: Anchoring Objects To The Face

The third effect is the one people mean by "filters" in the playful sense — virtual glasses, a hat, dog ears, a tried-on lipstick shade. Inside the W3C standards that govern web media, this whole category has a charming official name: the "Funny Hats" use case. The engineering is anchoring: keeping a virtual object glued to the face as it moves and turns.

This is where the mesh's 3D shape and the transformation matrix earn their place. To place virtual sunglasses, the effect reads the landmark points for the bridge of the nose and the temples, reads the head's angle from the transformation matrix, and draws the glasses model at that position and tilt. When you turn your head, the matrix updates, and the glasses turn with you because they are pinned to the map, not painted onto a fixed spot on the screen. Blendshapes add life on top: if the effect makes your avatar's mouth open when yours does, it is reading the jaw-open blendshape score and feeding it to the avatar's mouth.

Rendering the object is a graphics job, and on the web it runs on a technology called WebGPU — a browser standard, reaching wide support across Chrome, Safari, and Firefox in 2026, that lets a web page use the computer's graphics card directly. The face map says where and at what angle; WebGPU draws the object fast enough to keep up with live video. The same machinery powers the "virtual try-on" features in beauty and eyewear e-commerce, which are AR face effects wearing a commerce hat.

Where All This Runs In A WebRTC Call

Deciding which effect you want is the easy half. Wiring it into a live browser call is the half that trips teams up, and it comes down to one modern browser capability. WebRTC, the technology that lets browsers send audio and video to each other directly, historically handed your camera straight to the network with no place for your code to touch the picture. The fix is a standard called Insertable Streams, defined in the W3C document "MediaStreamTrack Insertable Media Processing using Streams."

The pattern has three parts, and it is worth understanding even at a high level because it shapes cost and quality. A piece called the MediaStreamTrackProcessor takes your camera feed and exposes it as a stream of individual frames your code can read — each one a VideoFrame object, the standard container for a single picture. Your code processes each frame: run the Face Landmarker on it, then smooth, redraw, or draw your object. A second piece, the VideoTrackGenerator, takes your finished frames and turns them back into a normal camera track that WebRTC can send. Camera goes in as frames, your effect runs, frames go back out — and the far end sees the effect, not the raw camera.

One detail in the standard is a quiet rule that decides whether your app stays smooth: this processing is meant to run in a worker, a background thread separate from the one that draws your app's buttons and menus. Heavy per-frame work on the main thread freezes the interface; the same work in a worker leaves the app responsive. The standard's own examples put the processing in a worker for exactly this reason. The broader set of these browser-AI hooks — Insertable Streams, Encoded Transform, WebGPU — is the subject of our WebRTC AI integration article; here the point is narrower: this is the slot your face effect lives in.

A WebRTC face-effect pipeline on white, left to right. A camera box feeds a MediaStreamTrackProcessor, which emits a stream of VideoFrame objects into a shaded Figure 3. Where a face effect attaches in WebRTC: frames are pulled with MediaStreamTrackProcessor, transformed in a worker, and rebuilt into a sendable track with VideoTrackGenerator.

The One Number That Decides Everything: The Frame Budget

Every face effect lives or dies on a single piece of arithmetic, and it is arithmetic a non-technical reader can do. Live video runs at a frame rate — a number of pictures shown per second, usually written as fps. At a common 30 frames per second, each frame is on screen for a fixed slice of time, and your effect must finish all its work inside that slice or frames pile up and the video stutters. The slice is one line:

time per frame = 1000 ms ÷ 30 fps
time per frame = 33.3 ms

So you have about 33 milliseconds — a third of one-tenth of a second — to read the frame, find the face, apply the effect, and hand the frame back. Everything you add spends from that budget. Finding the landmarks is the biggest cost, and here the hardware choice is decisive: run the Face Landmarker on the computer's main processor and a browser typically manages only 10 to 15 frames per second, which blows the budget; move it to the graphics card and it clears 60 frames per second comfortably, leaving room to spare. The 2019 research behind the face mesh reported the bare model running at hundreds of frames per second on mobile graphics chips, so the model is rarely the bottleneck — the bottleneck is whether you let it use the graphics card.

This is why the earlier WebGPU detail is not a footnote. Doing the face math on the graphics card is what keeps you inside the 33-millisecond budget with time left for rendering and encoding. A team that runs the landmarker on the main processor "to keep it simple" ships a stuttering call; the same team on the graphics card ships a smooth one. For the full picture of how every millisecond in a live call is spent, see the sub-100-millisecond latency budget article — face effects are one line item in that budget, and a hungry one.

A horizontal frame-budget timeline on white spanning 0 to 33.3 milliseconds, the time available for one frame at 30 fps. Four colored segments fill the bar in sequence: Figure 4. The 33-millisecond frame budget at 30 fps: finding the face dominates the cost, and putting it on the graphics card is what keeps the whole effect inside the deadline.

Build With MediaPipe Or Buy An SDK?

Once you know the effects share one map, the real decision is who builds that map and the effects on top of it: you, with free tools, or a vendor, for a fee. Both are legitimate; the right call depends on how custom your effects are and how fast you need to ship.

Building with MediaPipe means you get the landmark map for free and write the beauty, gaze, or AR logic yourself. You own the result, pay no per-user fee, and keep all video on the user's device, which is a real privacy advantage for telemedicine and similar fields. The cost is engineering time and the absence of a content library — MediaPipe gives you the map, not a catalogue of polished lenses. Buying an SDK from a vendor like Banuba, Snap Camera Kit, or DeepAR means you get the map plus a library of ready-made filters, a tool to author new ones, and cross-platform support, in exchange for a licence fee and a dependency on that vendor.

The vendors differ in ways that matter. Snap's Camera Kit brings the actual Snapchat lens technology and its huge creator ecosystem to your app, at the price of an approval process and less control. Banuba targets fast commercial deployment with predictable per-platform pricing that does not climb as your user count grows. DeepAR offers usage-based pricing that starts low. Gaze correction sits slightly apart from this group: it is less a "filter" and more a single specialized model, which is why it usually comes from NVIDIA Maxine or, for free on Apple hardware, from the phone itself.

Option	What you get	What it costs	Best fit	Watch out for
Build on MediaPipe Face Landmarker	Free 478-point map; you write the effects	Engineering time; no filter library	On-device privacy, custom effects, no per-user fee	You build and maintain the effect logic
Snap Camera Kit	Snapchat lenses + Lens Studio authoring	Licence; approval process	Consumer apps wanting proven viral lenses	Approval gate; less low-level control
Banuba Face AR SDK	Map + filter library, cross-platform	Per-platform licence	Fast production launch across iOS/Android/Web	Licence cost vs free MediaPipe
DeepAR	Map + face/body filters, web + mobile	Usage-based fee	Lean teams starting small	Vendor roadmap certainty
NVIDIA Maxine Eye Contact	Gaze-correction model	GPU + licence	Web/cross-device gaze correction	Needs NVIDIA GPUs (client or server)
Apple FaceTime Eye Contact	Built-in gaze correction	Free, built in	Native FaceTime on recent iPhones	FaceTime-only; you can't call it

Read the table by your constraints. If your effects are custom and privacy matters, building on MediaPipe is the honest default. If you want a library of lenses live next week, an SDK earns its fee. If you only need gaze correction, treat it as its own small decision between Maxine and "the phone already does it."

A Common Mistake: Processing On The Main Thread

The single most frequent way teams ship a broken face effect is not a bad model — it is running the per-frame work on the wrong thread. The browser has one main thread that both draws your app's interface and, if you let it, runs your effect. Pile heavy face-tracking and rendering onto that thread and it has no time left to redraw buttons, so the whole app stutters, clicks lag, and the video judders even though the model itself is fast.

The symptom is distinctive and easy to misdiagnose. The effect "works" in a quick test on a powerful laptop, then falls apart on a normal user's machine, and the team blames the model and starts shopping for a faster one. The real fix is structural: move the processing into a worker, the background thread the Insertable Streams standard expects you to use. The model does not need to be faster; the work needs to be off the thread that draws the screen. Teams that learn this once never unlearn it, and teams that skip it ship a feature that demos well and fails in the field.

The Regulatory Line You Cannot Cross

There is a place where a face effect stops being a cosmetic feature and becomes a regulated one, and crossing it by accident is a real risk. The European Union's AI Act — the bloc's comprehensive AI law, formally Regulation (EU) 2024/1689 — draws a sharp line between changing how a face looks and reading what a face feels.

Beautifying, gaze-correcting, and putting a hat on a face are not regulated as risky AI; they change appearance and infer nothing about the person. But remember the 52 blendshape scores — the dials that measure how much the face is smiling, frowning, or raising an eyebrow. The moment a system uses those signals to infer an emotional state — "this user seems happy", "this student looks disengaged" — it becomes an emotion recognition system in the law's terms, and that category carries hard rules. Using emotion recognition in workplaces and schools is prohibited outright, with narrow exceptions, and that ban has been in force since February 2025. Even where it is allowed, from August 2026 anyone exposed to such a system must be told it is running, before it runs.

The practical takeaway is a boundary to design against, not a reason to avoid the technology. Read the face to redraw it, and you are in safe territory. Read the face to judge the person's feelings — engagement scoring in an e-learning product, mood detection in a hiring call — and you have walked into the AI Act's strictest zone. For the related rules on detecting and recognizing faces, our face detection under the EU AI Act article covers the identity side of the same law. None of this is legal advice; it is the map of where to send your own lawyer before you ship an effect that touches expression.

Where Fora Soft Fits In

We build the live-video products where these effects show up in real use — video conferencing platforms, telemedicine consultations, e-learning classrooms, dating and social apps, and AR/VR experiences — and we treat beauty, gaze, and AR as one capability with three faces rather than three features. The pattern we apply is the one this article argues for: build the landmark map once, keep the per-frame work in a worker on the graphics card so the call stays smooth, and pick build-versus-buy per product — free MediaPipe when effects are custom and privacy matters, a vendor SDK when a launch needs a lens library next week. On gaze correction we decide by platform, trusting the phone where it already does the work and reaching for a server-side model where users are on the web. And we draw the emotion-recognition line early, because in telemedicine and e-learning the temptation to "measure engagement" is exactly where a helpful feature can become a regulated one.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your ar face filters plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Face Effects (Beauty / Gaze / AR) Engineering Cheat Sheet — One-page reference: the shared face-landmark map and how beauty, gaze, and AR effects each use it; MediaPipe Face Landmarker outputs (478 landmarks, 52 blendshapes, transformation matrix); where effects attach in a WebRTC call via….

References

Google AI Edge. Face landmark detection guide — MediaPipe Solutions, last updated 2026-01-29, accessed 2026-06-02. https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker. First-party source for the MediaPipe Face Landmarker outputs (478 3D landmarks, 52 blendshape scores, facial transformation matrix), the three bundled models (BlazeFace short-range detector 192×192, FaceMesh-V2 256×256, Blendshape 1×146×2), the IMAGE/VIDEO/LIVE_STREAM running modes, and the note that smoothing is applied only when num_faces = 1.
Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M. (Google Research). Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs, arXiv:1907.06724, 2019, accessed 2026-06-02. https://arxiv.org/abs/1907.06724. Peer-reviewed/academic source for the original 468-vertex face mesh model, its design for face-based AR effects, and the reported super-real-time inference (100–1000+ FPS) on mobile GPUs. Establishes the academic origin of the mesh that the iris model later extends to 478 points.
W3C. MediaStreamTrack Insertable Media Processing using Streams (mediacapture-transform), Working Draft 15 January 2026, accessed 2026-06-02. https://www.w3.org/TR/mediacapture-transform/. Primary standards source for the MediaStreamTrackProcessor / VideoTrackGenerator API, the VideoFrame-based per-frame processing model, the "Funny Hats" use case framing, and the worker-thread processing examples — the slot where a WebRTC face effect attaches.
W3C. Media Capture and Streams (mediacapture-streams), Candidate Recommendation Draft 9 October 2025, accessed 2026-06-02. https://www.w3.org/TR/mediacapture-streams/. Primary standards source for getUserMedia, the MediaStreamTrack model (sources, sinks, constraints), and the camera-track foundation that Insertable Streams processes.
W3C. WebCodecs (VideoFrame), Working Draft 24 November 2025, accessed 2026-06-02. https://www.w3.org/TR/webcodecs/. Primary standards source for the VideoFrame object — the single-picture container the effect reads and writes per frame — and its explicit close() resource-management requirement.
W3C. WebGPU, Candidate Recommendation 2026, accessed 2026-06-02. https://www.w3.org/TR/webgpu/. Primary standards source for the browser graphics-card API used to render AR objects and accelerate inference; cited for its 2026 Candidate Recommendation status and cross-browser (Chrome, Safari, Firefox) support that makes GPU-accelerated face effects viable on the web.
European Union. Regulation (EU) 2024/1689 (Artificial Intelligence Act) — Article 5 (Prohibited AI practices) and Article 50 (Transparency obligations), accessed 2026-06-02. https://artificialintelligenceact.eu/article/5/. Primary legal source for the prohibition of emotion-recognition systems in workplaces and education (in force from 2 February 2025), the biometric-categorization limits, and the Article 50 transparency duty for emotion-recognition / biometric-categorization systems (applicable from 2 August 2026). Defines the line between altering a face (unregulated) and inferring emotion from it (regulated).
NVIDIA. Overview — NVIDIA NIM Maxine Eye Contact, accessed 2026-06-02. https://docs.nvidia.com/nim/maxine/eye-contact/latest/overview.html. First-party source for the Maxine Eye Contact gaze-redirection pipeline: eye-patch extraction via a face-tracking step (2D landmarks + 6DOF head pose), latent encode / redirect / blend-back, the natural-transition behaviour when gaze drifts, and GPU-based deployment.
NVIDIA Developer Blog. Improve Human Connection in Video Conferences with NVIDIA Maxine Eye Contact, accessed 2026-06-02. https://developer.nvidia.com/blog/improve-human-connection-in-video-conferences-with-nvidia-maxine-eye-contact/. First-party engineering source detailing the eye-contact problem (camera-above-screen gaze offset) and the redirection method.
Apple. Use FaceTime Eye Contact / Attention Correction and WWDC face-tracking sessions, accessed 2026-06-02. https://support.apple.com/guide/iphone/facetime-settings/ios. First-party source for Apple's FaceTime Eye Contact feature using the TrueDepth camera and ARKit to build a 3D face/depth map and adjust the eyes in real time, on by default on recent iPhones — FaceTime-only and not callable from third-party web apps.
TRTC / Tencent RTC Engineering. Beauty Filters Explained: The Digital Technology Behind Perfect Selfies, accessed 2026-06-02. https://trtc.io/blog/details/beauty-filters-explained. Vendor engineering source for the beauty-filter pipeline: face-landmark mesh and Delaunay triangulation, bilateral-filter skin smoothing with frequency separation, mesh-vertex reshaping (jawline, eyes, nose), and region-masked color correction.
Banuba. 5 Best Face Filter SDKs (Tested in 2026) and Best Face Tracking SDKs for Real-Time Video Conferencing in 2026, accessed 2026-06-02. https://www.banuba.com/blog/best-face-filters-sdks-comparison. Vendor comparison source for the commercial Face AR SDK landscape (Banuba, Snap Camera Kit, DeepAR, BytePlus Effects, FaceUnity), per-platform vs usage-based pricing models, and integration-speed claims. Treated as a vendor source; product claims attributed, not asserted as neutral fact.
Snap Inc. Camera Kit — Snap AR, accessed 2026-06-02. https://ar.snap.com/camera-kit. First-party source for Snap Camera Kit bringing Snapchat lens technology and Lens Studio authoring to third-party apps, and the scale of the Snapchat lens ecosystem.
Chrome for Developers. Insertable streams for MediaStreamTrack, accessed 2026-06-02. https://developer.chrome.com/docs/capabilities/web-apis/mediastreamtrack-insertable-media-processing. First-party browser-vendor source for the MediaStreamTrackProcessor/Generator availability (Chrome, Edge; experimental, polyfills for Firefox/Safari) and the background-effects processing pattern.

Beauty Filters, Gaze Correction, AR Effects — MediaPipe Pipeline