Why this matters

If you run a video conferencing product, a contact centre, a telemedicine platform, or an online classroom, "there's too much background noise on the call" is one of the top three complaints you will ever read, and "the noise filter is eating my voice" is the complaint right behind it. Both are noise-suppression problems, and they pull in opposite directions. This article is for the product manager, founder, or operations lead who needs to understand the trade-off well enough to choose a strategy, set a budget, and ask an engineer a sharp question — not to write the filter themselves. A senior engineer will also find every claim traced to the original paper, the WebRTC source, the vendor's own documentation, or the relevant ITU-T Recommendation. By the end you will know why the free filter in the browser cannot remove a barking dog, what a model like Krisp or RNNoise actually does differently, what it costs in CPU and latency, and where each one belongs.


The problem: the listener hears the room, not just the voice

Start with what the listener experiences, because the whole solution follows from it. A microphone is honest to a fault: it captures everything that reaches it, not just the voice you care about. So along with the talker it picks up the laptop fan, the air conditioner, the keyboard, the dog, the children in the next room, the espresso machine, and the traffic through an open window. The listener then has to do the separating in their own head — and on a long call, that work is exhausting. The job of noise suppression is to do that separating for them, before the audio is ever sent, so the voice arrives clean.

We need one number to talk about how hard that job is, and that number is the signal-to-noise ratio, written SNR. It is the gap, in decibels, between how loud the voice is and how loud the noise is. A decibel, written dB, is a ratio on a logarithmic scale: every 6 dB roughly doubles or halves the amplitude. A high SNR — say 30 dB — means the voice towers over the noise and the call sounds fine with no help at all. A low SNR — say 5 dB — means the noise is nearly as loud as the voice, and without suppression the listener struggles. Most real video calls sit closer to 20 dB than to 0 dB, which is the comfortable middle where suppression has room to help without much risk. The hard cases — a call from a moving car, a nurse at a busy station, an agent in an open-plan room — are the low-SNR ones, and they are exactly where the old and new methods part ways.

One more idea before the methods, because it governs every design choice that follows. In a real-time call you cannot wait. Each chunk of audio — typically 10 or 20 milliseconds of sound, called a frame — has to be cleaned and sent almost immediately, because a human conversation falls apart once the round-trip delay climbs past roughly 300 to 400 milliseconds. So a real-time noise suppressor is only allowed to "look ahead" of the current frame by a few milliseconds at most. That single constraint — clean it now, with almost no peek at the future — is why noise suppression for calls is harder than noise removal for a podcast you can process overnight, and it is the constraint every method on this page is measured against.

Where noise suppression sits in the chain

Noise suppression does not work alone. In the WebRTC capture chain — the open-source audio cleanup that runs in Chrome, Edge, and most native voice apps — it sits in a fixed order with its neighbours, and the order is deliberate. First a high-pass filter removes the sub-audible rumble. Then acoustic echo cancellation removes the far end's voice leaking back through your speaker, covered in Acoustic Echo Cancellation (AEC): How It Really Works. Then noise suppression — this article — strips the background. Then automatic gain control sets the final level, covered in Automatic Gain Control (AGC): Keeping Voice Level Consistent. The whole sequence is mapped end to end in The WebRTC Audio Pipeline End-to-End.

The reason noise suppression runs before gain control matters for the complaints in your inbox. If the gain stage boosted the signal first, it would also boost the noise, and the suppressor downstream would have a harder, louder problem to solve. By cleaning first and amplifying second, the chain makes sure the loud thing the listener hears is the voice, not an amplified café.

Diagram of the WebRTC capture chain as left-to-right boxes: microphone, then high-pass filter, then acoustic echo cancellation, then noise suppression highlighted as the focus of this article, then automatic gain control, then the encoder that sends the audio. An arrow runs through all of them. A note beneath noise suppression explains it must clean the current frame with almost no look-ahead because a real-time call cannot tolerate added delay, and a note explains noise suppression runs before gain control so the gain stage amplifies an already-cleaned voice rather than the noise. Figure 1. Where noise suppression sits. It runs after echo cancellation and before gain control, so the level stage amplifies a voice that has already had its background removed.

The classical method: estimate the noise, then subtract it

The old family of noise suppressors — the one still inside every browser — rests on one clever observation. Between words and sentences, when nobody is talking, the microphone is capturing pure noise. If the system listens during those gaps, it can build a picture of what the noise looks like: how much energy it carries at each frequency. Then, when the talker speaks again, it can subtract that noise picture from the mixture and keep what is left, which should be mostly voice.

To do this, the suppressor first breaks each frame of sound into its frequencies — bass, mid, treble, and everything between — using a mathematical transform. Picture a graphic equalizer with dozens of sliders, one per frequency band. During the silent gaps, the system measures how much noise sits in each band; this is the noise spectral estimate. When the voice returns, it turns each band's slider down by the amount of noise estimated for that band, leaving the voice mostly intact. The earliest published version of this, spectral subtraction, was described by Steven Boll in 1979; a closely related approach, the Wiener filter, came from Jae Lim and Alan Oppenheim the same year, and a more refined statistical version, the MMSE short-time spectral amplitude estimator, from Yariv Ephraim and David Malah in 1984. WebRTC's noise suppressor is a direct descendant: it estimates the noise spectrum, computes a gain for each frequency band, and applies it, with a tunable aggressiveness from mild to very strong.

The appeal is obvious. It is tiny — a few kilobytes of code — and fast enough to run on a decade-old phone without noticing. It needs no training data and no model file. And for the case it was built for — steady, droning noise like a fan or an air conditioner that barely changes from one second to the next — it works well, because a steady noise measured during a gap is still the same noise a moment later when the talker speaks.

Why the classical method has a ceiling

Two failures are baked into the idea, and naming them tells you exactly when you have outgrown it.

The first failure is non-stationary noise — noise that changes fast. The whole method assumes the noise measured during the last silent gap still describes the noise now. That holds for a fan. It fails completely for a dog bark, a door slam, a keyboard clack, a baby's cry, or a second person talking, because those sounds arrive during the voice and look nothing like the quiet hum measured a moment earlier. The classical suppressor has no estimate for them, so it lets them straight through. This is why the free browser filter removes your air conditioner but does nothing about the dog — and why your users are confused about what "noise suppression" even means.

The second failure is the artifact it leaves behind, called musical noise. When the subtraction is imperfect — and it always is — isolated frequency bands survive here and there while their neighbours are cut, leaving little tones that flutter in and out. Listeners describe it as a "watery", "warbling", or "underwater" sound. Turn the aggressiveness up to chase more noise and you get more musical noise; turn it down to avoid the artifact and more noise survives. That tension — cut harder and damage the voice, or cut gently and leave the noise — is the permanent ceiling of the classical approach, and it is the gap the neural methods were built to close.

The pitfall, stated plainly. When a customer says "the call still has my dog barking even with noise suppression on," they have not found a bug — they have found the edge of the classical method. The browser's built-in suppressor is a stationary-noise tool. Barks, clatters, and other voices are non-stationary and need a trained neural model. The fix is not a setting; it is a different class of suppressor. Telling a customer to "turn up noise suppression" here makes it worse, because more aggressive subtraction means more musical-noise damage to their own voice while the bark still gets through.

The neural method: learn what voice looks like, keep only that

The modern family flips the question. Instead of measuring the noise and subtracting it, a neural suppressor is trained ahead of time on thousands of hours of examples — clean voice mixed with every kind of noise — until it learns, deep in its weights, what human speech looks like and what it does not. At run time it does not need a quiet gap to estimate anything. It looks at each frame and decides, frequency band by frequency band, how much of what it sees is voice worth keeping and how much is noise worth removing. Because it has learned the shape of a bark, a clatter, and a keyboard, it can remove them even though they arrive in the middle of a sentence — the exact case the classical method cannot touch.

The cost of this power is real and worth stating up front. A neural model is bigger than a few kilobytes of subtraction code; it has to run a network of arithmetic on every frame, which costs CPU or a GPU; and it has to be trained, which is a serious undertaking. The art of the field over the past decade has been getting that cost low enough to run in real time on an ordinary laptop or phone — and that is exactly the story of RNNoise, Krisp, and NVIDIA's noise removal.

Two-panel comparison diagram. Left panel, labelled classical, shows the noise being measured during a silent gap then subtracted from the voice, with a note that it only works on steady noise and leaves a watery musical-noise artifact. Right panel, labelled neural, shows a pre-trained model that has learned the shape of voice deciding per frequency band how much to keep, with a note that it removes sudden non-stationary noise like a dog bark or a door slam but costs more CPU and a model file. A caption summarises that classical subtracts a measured noise estimate while neural keeps a learned voice estimate. Figure 2. The two families. Classical suppression measures the noise and subtracts it, and only works on steady noise. Neural suppression has learned what voice looks like and keeps only that, so it can remove sudden noises the classical method lets through.

RNNoise: the hybrid that made neural suppression small

The breakthrough that made neural noise suppression practical for ordinary devices is RNNoise, published by Jean-Marc Valin — one of the engineers behind the Opus codec — at Mozilla and Xiph.Org in 2017–2018. Its key idea is not to throw a giant neural network at the whole problem, but to keep the cheap signal processing that already works and use a small neural network only for the hard part: deciding how much to suppress in each band. Valin calls this a hybrid DSP/deep-learning approach, and it is the reason the model is tiny.

Here is how it works in plain terms. Rather than asking the network to look at every one of the hundreds of frequency values in a frame — which would need a huge, slow model — RNNoise groups frequencies into 22 bands spaced the way the human ear hears, wider at high frequencies where the ear has poorer resolution. The network's only job is to output one gain per band — a number between 0 and 1 that says "keep all of this band" (1.0) or "cut this band entirely" (0.0) or something in between. Think of it as a 22-slider equalizer whose sliders the network adjusts many times a second to let the voice through and pull the noise down. Because the gains are bounded between 0 and 1, the model can never add sound that was not there, which rules out a whole class of errors.

The network at the centre is a recurrent neural network, abbreviated RNN — a network with a short memory of recent frames, which matters because telling voice from noise needs context over time, not just one instant. Specifically it uses gated recurrent units (GRUs), a memory cell that can hold a pattern across many frames. Three small GRU layers do most of the work, fed by 42 carefully chosen input features including the band energies, the pitch of the voice, and how strongly voiced the sound is. A separate trick called pitch filtering — a comb filter tuned to the talker's vocal pitch — cleans the noise that hides between the harmonics of a voice, something the 22 coarse bands are too wide to catch on their own.

The numbers are what make RNNoise remarkable. The trained model fits in about 85 kilobytes — small enough to ship inside an app without anyone noticing — because the weights are stored as 8-bit values rather than 32-bit floats. It runs roughly 60 times faster than real time on a desktop x86 CPU and about 7 times faster than real time on a Raspberry Pi 3, with no GPU required. And it obeys the real-time rule: it looks ahead only about 10 milliseconds. The code is open source under a permissive BSD licence. RNNoise is the suppressor behind the noise-removal feature in OBS Studio and many open-source voice tools, and its design — small hybrid model, per-band gains, GRU memory — became the template the commercial products refined.

Signal-flow diagram of the RNNoise hybrid architecture, left to right. Noisy audio enters and is split into 22 frequency bands spaced on the Bark scale that matches human hearing. Forty-two features including band energies and pitch feed three gated recurrent unit layers, the small neural network. The network outputs one gain between zero and one per band. Those gains are applied to the bands, and a pitch comb filter cleans the noise hiding between voice harmonics, producing clean audio. Annotations note the model is about 85 kilobytes, runs about 60 times faster than real time on a desktop CPU and 7 times on a Raspberry Pi 3, and looks ahead only 10 milliseconds. Figure 3. The RNNoise hybrid. Cheap signal processing splits the audio into 22 perceptual bands; a small GRU network outputs one gain per band; a pitch filter cleans between the harmonics. The whole model is about 85 kB and runs many times faster than real time on a plain CPU.

Krisp: the commercial on-device model

Krisp is the best-known commercial neural noise suppressor, sold both as a desktop app and, more relevant for product teams, as an SDK you embed in your own application. It pushes the neural approach further than RNNoise on quality, at the cost of a larger model and more CPU, and it adds capabilities the open-source baseline does not have.

Three facts about Krisp are worth a product team's attention. First, it runs on the device, not in the cloud: the audio is processed locally and, per Krisp's own documentation, voice data is not uploaded to a server. For telemedicine and finance, where sending raw patient or customer audio to a third party is a compliance problem, on-device processing is often the deciding factor. Second, Krisp ships its models in two sizes — a Small model that is the SDK default and is built to run on lower-end devices roughly seven times faster than the Big model, and a Big model that delivers the highest quality at a higher CPU cost. That choice — quality versus CPU — is one you make per device class, and we will return to it. Third, Krisp suppresses noise in both directions: outbound (your microphone, cleaned before you send it) and inbound (the other person's audio, cleaned as it arrives, which helps when they have no good filter).

On latency, Krisp's documentation gives a concrete number worth quoting because it sets expectations: for a 10-millisecond frame at a 16 kHz sample rate, the algorithmic latency is about 25 milliseconds. That is the delay the suppressor itself adds, and it is small enough for real-time calls but not zero — it is part of the mouth-to-ear budget you manage across the whole pipeline. Krisp's models are tuned for near-field capture, meaning a microphone within about half a metre of the mouth; a talker across the room is a harder case and the result depends on distance, echo, and SNR. The SDK runs on desktop, mobile, and — built on WebAssembly — directly in the browser.

NVIDIA's noise removal: the GPU-accelerated path

The third name product teams meet is NVIDIA's noise removal, which has a small history worth knowing because the names changed. It launched in 2020 as RTX Voice, a beta that used the AI-acceleration hardware (the Tensor Cores) on NVIDIA's RTX graphics cards to run a denoising network. NVIDIA later extended it to older GTX cards and folded it into the NVIDIA Broadcast app as the "Noise Removal" feature. For developers, the same technology is available through the Maxine Audio Effects SDK (Maxine is now branded "NVIDIA AI for Media"), which offers noise removal, room-echo removal, and related effects as a library you can build into an application.

The distinguishing trait is where it runs. RNNoise and Krisp are built to run on the CPU so they work on any machine; NVIDIA's path runs the heavier network on the GPU, which lets it use a larger, higher-quality model without taxing the processor that is also running the meeting. The trade is portability: it needs an NVIDIA GPU. That makes it a natural fit for a streamer, a broadcaster, or a content-creation workstation that already has a powerful GPU, and a poor fit for a phone or a thin laptop. For a product team, the rule of thumb is simple: if your users are on GPU-equipped desktops and you want the highest quality, NVIDIA's path is attractive; if your users are on everything, a CPU model like Krisp or RNNoise is the safer default.

A worked example: how aggressive should you suppress?

Every noise suppressor has a dial — call it the aggressiveness or the suppression strength — and the single most important thing to understand about it is that turning it up is not free. More suppression removes more noise and removes more voice. The art is finding the setting where the listener gains more from the quieter background than they lose from the slightly thinner voice. Numbers make this concrete.

Suppose a talker arrives at a signal-to-noise ratio of 10 dB — the voice is 10 dB louder than the noise, a fairly noisy call. A neural suppressor can comfortably pull the noise down by, say, 18 dB while leaving the voice nearly untouched. Walk the arithmetic:

new noise level = old noise level − suppression applied to noise
                = (−10 dB relative to voice) − 18 dB
                = −28 dB relative to the voice

new effective SNR = 28 dB  (voice now towers 28 dB over the residual noise)

The call went from a strained 10 dB SNR to a comfortable 28 dB SNR, and the listener stops working to hear. Now push the dial too hard and the model starts removing voice along with noise. If the most aggressive setting cuts an extra 6 dB of noise but in doing so attenuates the consonants — the t, k, and s sounds that carry intelligibility — the residual noise is lower but the voice is now muffled and harder to understand. The listener traded a clean background for a thick voice, which is usually a bad trade on a call where being understood is the whole point. This is why good defaults sit in the middle, and why the dial belongs to an engineer, not to an end user twisting it in the moment.

The same trade-off has a formal way to measure it, and it is worth naming because it appears in every serious comparison. The international method for judging a noise suppressor, ITU-T Recommendation P.835, asks listeners to rate three things separately: the speech signal on its own (called SIG — is the voice damaged?), the background on its own (called BAK — how intrusive is the leftover noise?), and the overall quality (called OVRL). A suppressor that scores well on BAK but poorly on SIG is one that kills noise by hurting the voice — the over-aggressive case above. The best models score well on all three, and the whole point of the three-way split is to stop a vendor from hiding voice damage behind an impressive noise number. A companion crowdsourced version, ITU-T P.808, runs the same idea at scale, and a machine-learning model called DNSMOS P.835 predicts those scores automatically so teams can test thousands of clips without a listening panel.

How the field measures progress: the DNS Challenge

If you want one place to see the state of the art, it is the Deep Noise Suppression (DNS) Challenge, run by Microsoft Research as a recurring competition at the speech conferences Interspeech and ICASSP from 2020 onward. Each round publishes a large open dataset of clean speech and noise, a standard test set, and a scoring method based on the ITU-T P.835 scales above, then ranks the entries. The challenge did two useful things for everyone, not just the entrants: it created shared, public training data so any team could build a competitive model, and it set an honest, standardised yardstick so "our model is better" became a claim you could check. The objective metric the challenge popularised, DNSMOS P.835, is now the everyday tool teams use to compare suppressors during development. For a product team the takeaway is not the leaderboard but the trend: neural suppression has improved year over year on a public benchmark, and the gap between a tuned commercial model and the classical browser filter is now large and well documented.

Choosing a noise suppressor: a practical guide

This is the most actionable part of the article, so here it is as a table you can act on. The right choice depends on three things: how much CPU or GPU you have, how hard the noise is, and whether you can send audio off the device.

Use case Recommended approach Why
General video meeting, mixed devices WebRTC built-in NS, or RNNoise for a step up Free, tiny, handles the common steady noise; RNNoise adds non-stationary noise for ~85 kB
Contact centre / agents in open-plan rooms Commercial neural SDK (e.g. Krisp), CPU model Babble and other voices are non-stationary; on-device keeps customer audio private
Telemedicine / finance, privacy-critical On-device neural model only Sending raw patient or customer audio to a cloud filter is a compliance risk
Podcast / recorded content, no time pressure Largest neural model you can afford, offline No real-time constraint, so use a big look-ahead model for maximum quality
Streamer / creator on a GPU workstation NVIDIA noise removal (Maxine / Broadcast) A powerful GPU is already present; run a bigger, higher-quality model off the CPU
Embedded / low-power device Classical NS, or RNNoise if CPU allows A few kilobytes and minimal CPU; RNNoise still fits where a big model will not
Music or instrument capture Turn noise suppression OFF A suppressor treats sustained instrument tones as noise and damages the recording

Two rules sit underneath the table. First, match the tool to the noise: steady noise is solved by the classical method for free, and you only pay for a neural model when you need to remove sudden, non-stationary sounds. Second, match the model size to the device: a Big model on a thin laptop or a phone will burn battery and may stutter, so a Small model or RNNoise is often the right default with the Big model reserved for capable machines. The last row is the one teams forget: for music, the right amount of noise suppression is none, because a suppressor trained on speech will hear a held violin note as noise and pull it down. Expose an off switch for creative use, the same way you would for Automatic Gain Control.

Settings to expose, and settings to hide

The same discipline that applies to gain control applies here: expose the intent, hide the mechanism. A non-engineer can tell you whether they are in a meeting or playing music; they cannot tune a suppression curve by ear in the moment without making it worse.

Setting Expose to users? Why
Noise suppression on/off Yes Users need it off for music, instruments, or pro-audio capture
A simple "low / medium / high" preset Sometimes Acceptable as three safe presets; never as a raw numeric dial
Model size (Small vs Big) No — auto-select by device The wrong choice stutters on weak hardware or wastes quality on strong hardware
Raw suppression strength (dB) No Too high muffles the voice; only an engineer should set the default
Inbound (speaker-side) suppression Sometimes Useful when the other side has no filter; keep it a simple toggle
Which algorithm (classical / RNNoise / Krisp / NVIDIA) No An implementation choice the product makes per device, not the user

Where Fora Soft fits in

We have built real-time audio into video conferencing, telemedicine, e-learning, contact-centre, and live-shopping products since 2005, and "too much background noise" versus "the filter is eating my voice" is a tension every one of those products has to resolve. In telemedicine the constraint is usually privacy first — a patient's audio cannot leave the device for a cloud filter — so an on-device neural model is the only acceptable answer, and we tune its aggressiveness so a quiet clinic stays clean without thinning the clinician's voice. In contact centres the noise is other agents talking, which is non-stationary babble that defeats the classical filter, so a trained model earns its CPU. Our work is mostly in choosing the right suppressor per device class, deciding when the free WebRTC filter is enough and when a commercial SDK is worth the integration, wiring the suppressor correctly into the WebRTC chain so it runs before gain control, and testing deliberately for voice damage with the SIG/BAK/OVRL split before a customer ever hears it. We do not train noise-suppression models from scratch; we make the right one behave on the messy mix of rooms and devices your users actually have.

What to read next

Call to action

References

  1. J.-M. Valin, A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement, IEEE International Workshop on Multimedia Signal Processing (MMSP), 2018 (arXiv:1709.08243). The RNNoise paper: 22 Bark-scale bands, three GRU layers, 42 input features, per-band gains bounded 0–1, pitch (comb) filtering, ~85 kB 8-bit model, ~60× real time on x86 and ~7× on a Raspberry Pi 3, ~10 ms look-ahead. https://jmvalin.ca/papers/rnnoise_mmsp2018.pdf
  2. RNNoise project, RNNoise: Learning Noise Suppression (demo and technical write-up), Xiph.Org / Mozilla, 2017. First-party description of the hybrid DSP/NN design, the GRU choice over LSTM, the musical-noise avoidance from per-band gains, and the BSD-licensed C implementation. https://jmvalin.ca/demo/rnnoise/
  3. ITU-T Recommendation P.835, Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm (11/2003). The SIG / BAK / OVRL three-scale method that separates voice damage from residual-noise intrusiveness — the standard every serious suppressor comparison uses. https://www.itu.int/rec/T-REC-P.835
  4. ITU-T Recommendation P.808, Subjective evaluation of speech quality with a crowdsourcing approach (06/2021). The crowdsourced companion to P.800/P.835 used to scale listening tests for noise-suppression evaluation. https://www.itu.int/rec/T-REC-P.808
  5. WebRTC source, modules/audio_processing/ns/ — the classical noise-suppressor module: per-frame spectral analysis, noise spectral estimate, per-band gain (Wiener-style) application, and tunable suppression aggressiveness. Accessed 2026-06-06. https://webrtc.googlesource.com/src/+/refs/heads/main/modules/audio_processing/ns/
  6. S. F. Boll, Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 1979. The original spectral-subtraction method and the source of the musical-noise artifact discussed here. https://ieeexplore.ieee.org/document/1163209
  7. Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6), 1984. The refined statistical estimator that reduced musical noise relative to plain spectral subtraction. https://ieeexplore.ieee.org/document/1164453
  8. C. K. A. Reddy et al., ICASSP 2022 Deep Noise Suppression Challenge, Microsoft Research, 2022 (arXiv:2202.13288). The DNS Challenge series: open datasets, P.835-based scoring, and the year-over-year benchmark for neural suppressors. https://arxiv.org/abs/2202.13288
  9. C. K. A. Reddy, V. Gopal, R. Cutler, DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors, Microsoft Research, 2021 (arXiv:2110.01763). The learned predictor of P.835 SIG/BAK/OVRL used as the everyday development metric. https://arxiv.org/abs/2110.01763
  10. Krisp, Noise Cancellation (NC) — SDK documentation. On-device processing (no audio upload), Small (default) vs Big models (Small ~7× faster), outbound and inbound NC, native 8/16/32 kHz support, near-field (<50 cm) tuning, and ~25 ms algorithmic latency at a 10 ms frame / 16 kHz. Accessed 2026-06-06. https://sdk-docs.krisp.ai/docs/noisecancellation
  11. NVIDIA, Maxine Audio Effects SDK Programming Guide (NVIDIA AI for Media). GPU-accelerated noise removal, room-echo removal, and related effects; the developer path behind RTX Voice and the NVIDIA Broadcast "Noise Removal" feature. Accessed 2026-06-06. https://docs.nvidia.com/deeplearning/maxine/audio-effects-sdk/index.html