Real-Time Noise Suppression In WebRTC — RNNoise WASM, DeepFilterNet, And Krisp AI Integration

Why This Matters

If your product runs live calls in the browser — video conferencing, telemedicine, an AI voice agent, a sales-coaching tool — background noise is the fastest way to make it feel cheap, and noise suppression is the fix. But in WebRTC the hard part is not which model to pick; it is where to put it, because the same model produces clean audio in one slot and broken, robotic audio in another. This article is for the product manager, founder, or engineering lead who has to decide how clean audio gets built into a WebRTC product and who has to talk to engineers about it without being bluffed by a demo. By the end you will understand the WebRTC audio path well enough to know why noise suppression belongs before the encoder, how RNNoise and Krisp AI actually attach to a live call, why running two suppressors at once makes things worse, and how to choose between building it yourself and buying a managed engine. If you want the deep comparison of how the models differ on speed, quality, and licensing, that lives in our companion article on real-time noise suppression models; this one is about plumbing them into WebRTC.

The WebRTC Audio Path, And The Three Homes For A Suppressor

To place a suppressor correctly you first need a mental map of where sound travels in a WebRTC call. The path is always the same. Your microphone captures sound. The browser cleans it a little. The audio is then compressed by a codec — almost always Opus, the codec WebRTC requires every browser to support (IETF, RFC 7874). The compressed audio crosses the network, often through a server called an SFU — a Selective Forwarding Unit, the relay that copies each person's stream to everyone else. At the far end the audio is decompressed and played to the listener's ear.

"Compressed" is the word that matters most here, so let us anchor it. Compressing audio means throwing away detail to make the data small enough to send — like saving a photo as a small JPEG instead of the full-size original. Once audio is compressed by Opus, the individual sound samples are gone; what remains is a dense package of bytes that a decoder can turn back into approximate sound. Hold that thought, because it decides everything below.

A noise suppressor can be inserted in exactly three places along that path. The first is the built-in suppressor the browser already runs at capture, before anything else. The second is a custom model you insert yourself, after capture but before the Opus encoder compresses the sound. The third is on the server — inside the SFU, inside an AI agent that receives the call, or at the phone-line gateway for calls that come in over the traditional telephone network. Each home has different costs, and most of this article is about choosing among them.

A horizontal WebRTC audio pipeline from microphone to listener's ear, with three labelled insertion points for noise suppression: point A the browser's built-in suppressor at capture, point B a custom model inserted before the Opus encoder, and point C a server-side model running in the SFU, AI agent, or phone gateway. Figure 1. A WebRTC call has exactly three places to suppress noise: the built-in browser suppressor (A), a custom model before the encoder (B), or a model on the server (C).

Start With What You Already Have: The Built-In Suppressor

Before you integrate anything, check whether the suppressor you already get for free is enough. Every WebRTC browser ships an audio-cleanup stage at capture that includes noise suppression, echo cancellation, and automatic gain control. You switch the noise part on with a single option when you ask for the microphone.

// Ask for the mic with the browser's built-in noise suppression and echo cancellation on.
navigator.mediaDevices.getUserMedia({
  audio: { noiseSuppression: true, echoCancellation: true },
  video: false
});

This noiseSuppression option is part of the W3C Media Capture and Streams standard — the specification that defines how a web page gets a microphone (W3C, Media Capture and Streams). When you set it to true, the browser applies its own built-in suppressor. It is free, it adds no integration work, and the audio never leaves the device.

The catch is quality. The built-in suppressor is tuned for ordinary two-way conversation in a moderate room. It handles a steady fan or hum well, but it is weaker than a modern dedicated model on hard cases — overlapping background voices, sudden clatter, a busy café. For many products the built-in baseline is genuinely enough, and the correct engineering decision is to ship it and move on. You reach past it only when users complain about noise it misses, or when you need the same behaviour across every browser instead of each browser's own version. (For the rule of trying the built-in baseline first and the quality differences between models, see the models article.)

Why Noise Suppression Cannot Live In An Encoded Transform

When teams decide to add a custom suppressor, the first wrong turn is reaching for the wrong API. WebRTC exposes a feature called Encoded Transform that lets you run your own code on the media as it flows through a call, and it sounds like the perfect hook. It is the wrong one for audio cleanup, and understanding why teaches you the core rule.

The WebRTC Encoded Transform standard runs your code on encoded frames — the audio after the Opus codec has already compressed it, sitting between the encoder and the part that packages it for the network (W3C, WebRTC Encoded Transform). Remember the JPEG analogy: by the time audio reaches this stage, the individual sound samples have been thrown away and replaced with compressed bytes. A noise suppressor works by examining the actual sound — which frequencies are voice, which are noise — and turning the noise down. It cannot do that to a package of compressed bytes any more than you can erase a person from a photo by editing the file's compression data. Encoded Transform is built for things that operate on the compressed package as a whole, such as end-to-end encryption. It is the wrong layer for noise suppression.

The right layer is raw audio — the uncompressed stream of sound samples, before the encoder touches them. The browser gives you two ways to reach raw audio. The older, widely supported way is the Web Audio API's AudioWorklet, a small audio-processing unit that the browser runs in a dedicated audio thread (W3C, Web Audio API). The newer way is a MediaStreamTrackProcessor, which exposes each chunk of microphone audio as a raw AudioData object you can read and modify (W3C, MediaStreamTrack Insertable Media Processing). Both put your code in the right place: after capture, on raw samples, before the Opus encoder. That is insertion point B from Figure 1, and it is where every custom suppressor belongs.

Figure 2. Noise suppression needs raw samples, so it belongs in an AudioWorklet before the encoder — never in an Encoded Transform, which only sees compressed bytes.

Wiring RNNoise Into The Browser: The AudioWorklet Pattern

The cleanest example of building a custom suppressor into a WebRTC call is RNNoise running in an AudioWorklet, and it is not theoretical — it is exactly how the open-source Jitsi Meet conferencing platform ships noise suppression. Walking through their approach shows every moving part.

RNNoise is a tiny open-source noise suppressor written in the C programming language. To run it in a browser, you compile it to WebAssembly — usually shortened to WASM — a format that lets non-browser code like C run at near-native speed inside a web page (Jitsi, 2022). The compiled module is a few hundred kilobytes, small enough to load instantly.

The reason it goes in an AudioWorklet rather than older audio code is responsiveness. The browser's earlier audio-processing hook ran on the main thread — the same thread that draws the page and handles clicks — so any hiccup in the interface could distort the audio. An AudioWorklet runs on a separate, dedicated audio thread, so the sound is processed without interference from the rest of the page (W3C, Web Audio API). Jitsi moved from the old main-thread approach to an AudioWorklet precisely because, once you are modifying the audio rather than just measuring it, any delay or stutter becomes audible (Jitsi, 2022).

There is one piece of arithmetic every team hits here, and it is worth showing because it explains a real chunk of integration work. An AudioWorklet hands your code audio in fixed blocks of 128 samples at a time — a size you cannot change (W3C, Web Audio API). RNNoise, however, wants to process 480 samples per call (Jitsi, 2022). At WebRTC's standard 48,000 samples per second, those two numbers translate to time like this:

128 samples ÷ 48,000 samples/sec = 0.00267 sec ≈ 2.67 ms  (what the worklet delivers)
480 samples ÷ 48,000 samples/sec = 0.01000 sec = 10.0 ms   (what RNNoise wants)

The block the browser delivers (2.67 ms) is smaller than the block the model needs (10 ms), so you cannot simply hand each block straight to RNNoise. The fix is a holding area called a circular buffer: you accumulate the small 128-sample blocks until you have 480, send that to RNNoise to clean, then pass the cleaned audio onward — keeping any leftover samples for the next round (Jitsi, 2022). Once the worklet is producing clean audio, you swap the cleaned audio in for the original microphone track, and from there the call proceeds normally: Opus encodes the clean audio, and everyone hears the difference.

// Sketch: route the mic through a noise-suppression AudioWorklet, then send the cleaned track.
const ctx = new AudioContext({ sampleRate: 48000 });
await ctx.audioWorklet.addModule('rnnoise-worklet.js'); // your compiled WASM + worklet
const mic = await navigator.mediaDevices.getUserMedia({
  audio: { noiseSuppression: false, echoCancellation: true } // built-in NS OFF — see below
});
const source = ctx.createMediaStreamSource(mic);
const denoiser = new AudioWorkletNode(ctx, 'rnnoise-processor');
const sink = ctx.createMediaStreamDestination();
source.connect(denoiser).connect(sink);
const cleanTrack = sink.stream.getAudioTracks()[0]; // send this track over the peer connection

A Common Mistake: Running Two Suppressors At Once

Look closely at the code above and you will see noiseSuppression: false. That is not a typo — it is the single most important line, and getting it wrong is the most frequent mistake in this whole topic.

The trap is double suppression. If you leave the browser's built-in suppressor on and run your own model, the audio passes through two suppressors in a row. This is worse than either one alone for two reasons. The first is wasted compute: you are paying processor cost twice to do the same job, which drains laptop batteries and stutters on phones. The second is damaged audio: a modern suppressor is trained on raw, untouched sound, and feeding it audio that another suppressor has already altered confuses it and can produce strange, processed-sounding results. LiveKit, which licenses noise-cancellation models into its platform, states the rule plainly in its own documentation — do not enable a noise-cancellation model on top of audio that another model has already processed, because the models are trained on raw audio and may misbehave otherwise (LiveKit, 2026).

The fix is simple discipline: exactly one noise suppressor in the path. When you ship a custom model, set noiseSuppression: false so the browser's built-in one steps aside. When you rely on the built-in one, do not also add a custom model. Echo cancellation and automatic gain control are separate jobs and can usually stay on — the rule is one noise suppressor, not one audio process total. (Echo cancellation, a neighbouring but distinct problem, has its own deep dive in our echo cancellation in WebRTC article.)

A before-and-after diagram. The top row shows a wrong setup where audio passes through the built-in suppressor and then a custom model, labelled double processing with two cost icons and a distorted output waveform. The bottom row shows the correct setup with the built-in suppressor switched off and a single custom model, labelled one suppressor with a clean output waveform. Figure 3. Leaving the built-in suppressor on while running a custom model double-processes the audio — wasted compute and distorted voice. Switch the built-in one off.

DeepFilterNet In The Browser: The Weight Problem

RNNoise is light enough to run in any browser. DeepFilterNet — the open-source model that delivers noticeably higher quality through a heavier two-stage design — is a harder fit, and the reason is size and tooling.

The usual way to run DeepFilterNet in a browser is through a machine-learning runtime called ONNX Runtime Web, and that runtime alone weighs about 11.8 megabytes before you add the model — a heavy download compared with RNNoise's few hundred kilobytes. It has also historically had trouble running inside an AudioWorklet, the very place real-time audio needs to live. Teams have worked around this with hand-built WebAssembly versions that are far smaller, and there are working DeepFilterNet integrations for WebRTC clients, but it remains more engineering effort than RNNoise for a browser deployment. The practical pattern in 2026: reach for RNNoise WASM when you want a free model fully in the browser, and consider DeepFilterNet when its higher quality justifies the extra weight — or run it on the server instead, where its size stops mattering. For how the two models compare on raw quality and latency, see the models article.

Client-Side Or Server-Side: Choose By Who You Control

The custom-model paths above all run in the user's browser — the client side. The third home, the server, is the right choice in a set of situations worth naming clearly, because the decision turns on a single question: do you control the device the audio comes from?

Running on the client, in the browser, has real advantages. The audio is cleaned before the lossy Opus encoder touches it, which is the best possible moment — you are denoising the original sound, not a compressed copy. The processing cost spreads across every user's own device, so it scales for free as you add participants, and the server stays cheap. This is the default for browser-to-browser conferencing.

Running on the server becomes necessary when you do not control the source device. The clearest case is the traditional phone network: when someone dials into your product from an ordinary phone over what engineers call SIP — the protocol that bridges internet calls to phone lines — there is no browser on their end to run a model, so the only place to clean that audio is your server. The same logic applies to AI voice agents: a bot that joins a call to transcribe or respond receives audio from many sources, and cleaning it centrally, on the way in, is simplest — and cleaner audio directly lifts the accuracy of any live captions or transcription built on top. Server-side also gives you one place to control quality and update models, at the cost of processor time you now pay for on every stream. Heavy GPU-based engines such as NVIDIA's Maxine audio effects live here, cleaning audio on server graphics cards before it moves to transcription or playback (NVIDIA, 2026).

Question	Client-side (browser)	Server-side (SFU / agent / SIP)
Where it runs	Each user's own device	Your servers
Relative to Opus encoder	Before encoding (cleans the original)	After decoding (cleans a compressed copy)
Who pays the compute	The user's device	You, per stream
Scales with users	Yes, automatically	No — your cost grows
Works for phone-in (SIP) callers	No (no browser on their end)	Yes
Best fit	Browser-to-browser conferencing	AI voice agents, phone bridges, central control

The honest summary: clean on the client when every participant is in a browser you control, and clean on the server when audio arrives from devices or networks you do not. Many production systems do both — client-side for browser users, server-side for phone-in callers.

Krisp AI: The Managed Path Into WebRTC

So far the custom models have been ones you wire in and operate yourself. Krisp AI is the alternative: a managed, commercial engine you license and embed, where the vendor owns the model and the platform coverage. Because "Krisp" is the name most teams hear first, it is worth being precise about what it is and how it attaches to a WebRTC call.

Krisp is a Voice AI company, founded in 2017 and based in Berkeley, California, whose noise-cancellation software runs entirely on the device — by the company's own statement, audio is never uploaded to its servers (Krisp, 2026). Its technology powers noise cancellation inside Discord, RingCentral, Zoho, and, through partnerships, the calling platforms Twilio and Daily; in 2025 the company reported around $37.7 million in annual revenue (getLatka, 2026). Krisp ships as a software development kit — an SDK, a pre-built code library a developer drops into an app — for Windows, macOS, Linux, Android, iOS, and browsers, the last via a WebAssembly build. In a live call it adds roughly 25 milliseconds of delay for its standard model and about 15 for a lighter one, which keeps it inside a conversation's latency budget (Krisp, 2026).

How it attaches to WebRTC depends on the slot. In the browser, Krisp installs as an audio processor on the microphone track before the audio is sent — the same insertion point B as RNNoise, just managed for you. LiveKit, for example, exposes Krisp through a one-line track processor in its web SDK:

// LiveKit's managed Krisp filter: attach it to the local microphone track before publishing.
const { KrispNoiseFilter } = await import('@livekit/krisp-noise-filter');
const krisp = KrispNoiseFilter();
await localAudioTrack.setProcessor(krisp);   // cleans outbound audio in the browser
await krisp.setEnabled(true);

On the server, Krisp runs inside the platform: Twilio loads the Krisp SDK alongside its Voice JS SDK and runs it as a pre-processing step between the microphone and the audio encoder, while LiveKit and others run it on inbound audio for AI agents and switch it on at the phone gateway with a single krisp_enabled: true flag for SIP calls (Twilio, 2026; LiveKit, 2026).

Krisp's most useful 2026 capability for conferencing is a second model called background voice cancellation, or BVC. Ordinary noise cancellation removes non-speech sound — fans, traffic, keyboards — while keeping every voice. BVC goes further and removes other people's voices, keeping only the main speaker. That matters in open offices and for AI voice agents, where a colleague talking nearby otherwise gets transcribed as if it were the speaker. The improvement is measurable: in LiveKit's published test, raw noisy audio produced a transcription word-error rate of 117.6 percent (worse than useless — the transcript had more errors than words), while Krisp's BVC model cut that to 23.5 percent and a competing engine, ai-coustics, reached 7.1 percent (LiveKit, 2026). The lesson is both that these models work and that there is now more than one serious vendor — Krisp is the best-known, but it is no longer the only managed option.

RNNoise WASM Versus Krisp AI: The WebRTC Build-Vs-Buy

Stripped to its essentials, the decision for a WebRTC product is a three-way fork, and it is different from the generic model comparison because it is about operations, not just quality.

Option	What you ship	Cost	Quality on hard noise	Cross-platform	Best for
RNNoise WASM	A model you compile and wire into an AudioWorklet	Free (BSD licence)	Good on steady noise, weaker on hard cases	You build each platform	Browser-first, tight budget, ordinary noise
DeepFilterNet	A heavier open model, browser or server	Free (MIT / Apache)	Higher, at more weight and latency	You build each platform	Quality-sensitive, can absorb the weight
Krisp AI (or ai-coustics)	A licensed managed SDK + processor	Paid	Production-grade; BVC removes other voices	All major platforms out of the box	Cross-platform, zero ops, voice agents, SIP

Read it as operations, not a leaderboard. RNNoise WASM is free and fully yours, but you compile it, wire the AudioWorklet, manage the buffering, and accept its quality ceiling on hard noise. Krisp (and now ai-coustics) costs money but hands you production quality across every platform, the background-voice-cancellation model, and zero maintenance — the vendor updates the model, not you. The brand-by-brand comparison of Krisp against NVIDIA Maxine and Dolby for a build-versus-buy decision is its own topic, covered in our Krisp vs Maxine vs Dolby article; the rule here is narrower — if you ship in the browser only and your noise is ordinary, RNNoise WASM is the cheapest correct answer, and the moment you need every platform, phone-in support, or other-voice removal, a managed engine earns its price.

Figure 4. The WebRTC fork: compile and wire an open model yourself (RNNoise/DeepFilterNet), or license a managed engine (Krisp/ai-coustics) and buy the operations away.

Where Fora Soft Fits In

We build the WebRTC products that depend on clean live audio — video conferencing platforms, telemedicine consults where a doctor must catch every word, e-learning classrooms, and AI meeting tools. In that work the integration choice matters more than the model name. For a browser-first conferencing feature on a tight budget, the right answer is often the built-in baseline first, then RNNoise in an AudioWorklet where users hit its limits — wired before the encoder, with the built-in suppressor switched off to avoid double processing. For products that take phone-in callers or run AI voice agents, server-side suppression on inbound audio is the only option that covers devices we do not control, and a managed engine with background-voice cancellation is usually worth its price there. The pattern we apply is to settle three questions in order — where in the pipeline, client or server, build or buy — before anyone touches a model, because a suppressor in the wrong slot is not a quality problem you can tune away, it is an architecture problem you have to rebuild.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your krisp ai plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the WebRTC Noise Suppression Integration Cheat Sheet — One-page reference: the three insertion points (built-in, custom-before-encoder, server-side); the built-in baseline getUserMedia snippet; the raw-vs-encoded rule (use an AudioWorklet or MediaStreamTrackProcessor, never an Encoded….

References

IETF. RFC 7874 — WebRTC Audio Codec and Processing Requirements, May 2016, accessed 2026-06-02. https://www.rfc-editor.org/rfc/rfc7874. Primary standards source for the mandatory WebRTC audio codecs (Opus and G.711) and the audio processing context. Establishes that audio in a WebRTC call is compressed with Opus before transmission — the fact that determines why noise suppression must run on raw audio before the encoder.
W3C. Media Capture and Streams (Recommendation), accessed 2026-06-02. https://www.w3.org/TR/mediacapture-streams/. Primary standards source for getUserMedia and the noiseSuppression and echoCancellation constrainable audio properties — the built-in baseline suppressor and the switch used to disable it when a custom model is in the path.
W3C. Web Audio API (Recommendation, 17 June 2021), accessed 2026-06-02. https://www.w3.org/TR/webaudio/. Primary standards source for AudioWorklet and AudioWorkletProcessor, the off-main-thread audio-processing mechanism that hosts a custom suppressor, and for the fixed 128-sample render quantum that drives the buffering arithmetic.
W3C. MediaStreamTrack Insertable Media Processing using Streams (Working Draft), accessed 2026-06-02. https://www.w3.org/TR/mediacapture-transform/. Primary standards source for MediaStreamTrackProcessor, which exposes raw microphone audio as AudioData chunks — the second raw-audio insertion point for a custom suppressor. Per §4.3.2 the spec controls the definition of the raw-audio path over any vendor paraphrase.
W3C. WebRTC Encoded Transform (Working Draft), accessed 2026-06-02. https://www.w3.org/TR/webrtc-encoded-transform/. Primary standards source establishing that RTCRtpScriptTransform operates on encoded frames (RTCEncodedAudioFrame) between the encoder and packetizer — the technical basis for why noise suppression cannot run there (the samples are already compressed).
W3C. WebRTC: Real-Time Communication in Browsers (Recommendation, 13 March 2025), accessed 2026-06-02. https://www.w3.org/TR/webrtc/. Primary standards source for the RTCPeerConnection model and the live-call platform the cleaned audio is sent over.
ITU-T. Recommendation G.114 — One-way transmission time, accessed 2026-06-02. https://www.itu.int/rec/T-REC-G.114. Primary standards source for the mouth-to-ear delay budget (≤150 ms one-way for natural interactive voice) that every suppressor's added latency, including the AudioWorklet buffering delay, must fit inside.
ITU-T. Recommendation P.835 — Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm, approved 13 November 2003, accessed 2026-06-02. https://www.itu.int/rec/T-REC-P.835. Primary standards source for the three-axis (SIG / BAK / OVRL) human evaluation of noise suppressors — the honest way to judge whether a chosen integration actually improved quality rather than just removed noise.
Jitsi (Gavrilescu, A.). Enhanced noise suppression in Jitsi Meet, 28 July 2022, accessed 2026-06-02. https://jitsi.org/blog/enhanced-noise-suppression-in-jitsi-meet/. First-party deployer source for the canonical RNNoise-in-AudioWorklet integration: RNNoise compiled to WASM via Emscripten, the move from main-thread ScriptProcessorNode to an off-thread AudioWorklet, the 128-sample render quantum versus RNNoise's 480-sample frame, and the circular-buffer fix.
LiveKit. Noise & echo cancellation (documentation), accessed 2026-06-02. https://docs.livekit.io/transport/media/noise-cancellation/. Vendor source for managed Krisp and ai-coustics integration into WebRTC: the frontend setProcessor track-processor pattern, the do-not-double-process guidance (models trained on raw audio misbehave on pre-processed input), the Krisp NC vs BVC model split, the SIP krisp_enabled server flag, and the published word-error-rate comparison (original 117.6%, Krisp BVC 23.5%, ai-coustics Voice Focus 2.1 S 7.1%).
Twilio. Noise Cancellation and Introducing Audio Processor APIs (documentation and engineering blog), accessed 2026-06-02. https://www.twilio.com/docs/video/noise-cancellation. Vendor source for the Krisp integration into Twilio Programmable Voice — the Krisp SDK loaded alongside the Voice JS SDK and run as a pre-processing step between the microphone and the audio encoder, entirely on-device.
Krisp. Real-Time AI Voice SDK and Noise Cancellation (developer documentation), accessed 2026-06-02. https://krisp.ai/developers/. First-party source for Krisp's on-device processing (no audio uploaded), platform coverage (Windows, macOS, Linux, Android, iOS, browsers via WebAssembly), the NC and BVC models, ~25 ms (≈15 ms lite) added latency, and the Discord / RingCentral / Zoho deployment base.
getLatka. Krisp revenue and company profile, 2025/2026, accessed 2026-06-02. https://getlatka.com/companies/krisp.ai. Secondary source for Krisp's scale — ~$37.7M reported 2025 revenue, ~343-person team, Berkeley headquarters, founded 2017 — used only for company context, not technical claims.
NVIDIA. Maxine Audio Effects SDK — Background Noise Removal (now NVIDIA AI for Media), accessed 2026-06-02. https://developer.nvidia.com/maxine. Vendor source for GPU-accelerated server-side noise removal (BNR / BNR 2.0) and room-echo removal — the heavy-engine example of the server-side insertion point, including its use as an audio-cleaning stage ahead of ASR.