How we replaced Apple's voice processing with a neural network because of Spatial Audio.

A macOS dictation app, an M1 Max with Spatial Audio enabled, three to six seconds of every recording silently gone. Two wrong hypotheses, one half-right one, then a CoreAudio log told us we'd been engineering around a feature we didn't need.

~2,800 words · 11 min readTech covered: macOS,voice AI,STT,CoreAudioStack: Swift + RNNoise + Whisper-style backend

The setup

We've been working on a macOS dictation app called Swifly. The product loop is short: hold a global hotkey, speak, release the hotkey, get clean text back from a Whisper-style speech-to-text backend, inserted at the cursor.

When you build this kind of app, microphone audio quality matters more than you'd expect. Real users record in the wild — keyboard clatter, fans humming, café espresso machines, kids in the background, a 70 dB HVAC unit two metres away. Speech recognition is robust, but it's not magic. Noisy input means dropped words, hallucinated phrases, weird capitalisation. Cleaning the audio before it reaches the model is one of the highest-leverage things you can do for perceived accuracy of the whole pipeline.

So when our client asked us to ship “on-device noise suppression” as part of the next release, we said yes. The plan sounded uncontroversial: use Apple's official voice-processing pipeline, ship in a couple of days, move on.

It took two weeks. We threw away most of the work. We ended up with a much better solution.

The official path

Apple ships a single public API for voice cleanup on macOS, called VoiceProcessingIO — VPIO for short. You enable it on an AVAudioEngine input node with one line:

swift · VPIO enableCopy

try engine.inputNode.setVoiceProcessingEnabled(true)

Behind that flag, three things happen at once: AEC (Acoustic Echo Cancellation, removes speaker bleed into the mic), NS (Noise Suppression, removes background noise), AGC (Automatic Gain Control, levels out volume). This is the same pipeline Apple uses for FaceTime. It's the canonical answer on every Apple Developer thread and WWDC session about microphone capture.

We built it. Wired up the engine, set the device, configured AGC, disabled the speaker output volume to keep audio from leaking back, added a tap on the input node, started the engine. Standard stuff.

It worked perfectly on three of our four test machines.

The symptom

On the fourth machine — the client's own M1 Max MacBook Pro — the first three to six seconds of every recording were silently dropped. Same code, same Xcode build, same app. Same hotkey behaviour. Same waveform written to disk for the last N seconds of speech. The first N seconds were just gone.

The investigation

When you can't reproduce on your own machine, you start grasping at hypotheses. We had three.

Hypothesis 01

Hotkey latency.

Maybe the UI lag between keydown and engine-running was too long. We reordered the pipeline so the recording state flips on the keydown event before any async work, removed a redundant DispatchQueue.global().sync wrapper that was blocking the main thread 50–200 ms per press, pre-warmed CoreAudio at app launch.

Partial win, didn't fix it

Hypothesis 02

AGC ramp-up.

Apple's automatic gain control ramps from near-zero on every engine start. If speech starts before AGC converges, first words land near silence and the transcriber drops them. We disabled AGC, kept AEC and NS enabled.

Partial win, didn't fix it

Hypothesis 03

AEC convergence.

AEC's adaptive filters need 500 ms–2 s to learn the room. We refactored the lifecycle to keep VPIO running between hotkey presses — on release, just remove the tap; on the next press, install a new tap on the still-running engine. AEC state stays converged across presses.

Did nothing on the broken machine

The first two helped on machines where VPIO already worked. The third didn't help on the broken machine at all. The first sample was still empty. Something deeper was wrong.

So we did what we should have done sooner: added clear NSLog traces around the VPIO setup, reproduced the bug, and read the logs.

The clueReading the CoreAudio logs

Here's what we saw on the client's machine, line by line:

CoreAudio · AUVoiceProcessor init

AUVoiceProcessor instantiated
AUHAL stream format error.
  auhalInBusInScopeASBD channel count: 6
  auhalOutBusOutScopeASBD channel count: 2
  mic channel count: 0
  ref channel count: 6
client-side input and output formats do not match (err=-10875)
AUVoiceProcessor initialize failure. err=-10875

VPIO wasn't slow on this machine. It wasn't running at all. It was failing to initialise on every single capture attempt, with err=-10875 — “client-side input and output formats do not match.” Each failure took roughly two seconds of wall time during which the audio path was blocked and nothing was captured. The first words were vanishing into the gap between “I tried to start VPIO” and “VPIO told me no.”

The actual first error in the chain isn't -10875. It's -10879 — kAudioUnitErr_InvalidPropertyValue — emitted earlier. It points at the actual mismatch: VPIO expects, in its internal accounting, a 6-channel input stream and a 2-channel output stream, paired into a single aggregate device. The reference channel count it logged was 6.

Six channels. On a microphone that's physically mono.

Spatial Audio happened

The MacBook Pro 14" and 16" with M1 Pro/Max have built-in speakers that, since macOS Monterey, expose themselves as 5.1 surround when Spatial Audio is enabled. From the user's perspective this is a quality-of-life feature for movies and music. From CoreAudio's perspective, it changes the speaker device's reported channel count from 2 to 6, and that change ripples down into every API that touches the output device.

VPIO touches the output device. Specifically, it builds an internal aggregate device that pairs the input (your mic) with the output (your speakers) so its echo cancellation can subtract speaker output from mic input. The architecture assumes the output is stereo; the AEC expects a 2-channel reference signal. On a 6-channel output device, the aggregate-device construction can't produce a format pairing that satisfies its own internal invariants. The init bails with -10875.

It's a real bug in Apple's pipeline. It also doesn't matter, because we shouldn't have been using it.

The architectural realisation

Here's the thing we should have noticed three weeks earlier: for a dictation app, AEC has nothing to cancel.

AEC removes the sound of your own speakers from the mic signal. It's what stops your voice coming back through your friend's mic on a Zoom call. It's essential for two-way audio.

Dictation is not two-way audio. There is no playback. The speakers are silent during recording. There is no echo to cancel. AEC is a free pass-through doing nothing.

But AEC is also the only reason VPIO's pipeline depends on the output device at all. NS is a pure mic-side operation. AGC is a pure mic-side operation. AEC is what makes VPIO build an aggregate device, demand a stereo reference, and fall over on Spatial Audio Macs, Bluetooth headsets, and aggregate output configurations.

The right thing wasn't to make AEC work on Spatial Audio. The right thing was to throw AEC away.

Apple's API doesn't let you. setVoiceProcessingEnabled(true) enables the bundle. There is no setNoiseSuppressionEnabled(true) independent of it. So we needed a noise suppressor that was not Apple's.

VPIO vs RNNoise — the decision

Property	Apple VPIO	RNNoise
Includes AEC	Yes — cannot disable	No — pure NS
Output-device dependency	Builds aggregate device, needs 2-ch reference	None — mic only
Spatial Audio behaviour	Init fails with err=-10875	Unaffected
Bluetooth / aggregate output	Intermittent failures	Unaffected
CPU cost (Apple Silicon)	Hardware-accelerated, near-zero	~1% of one core, single channel
Per-device code paths	Detection, fallback, probing required	None
Licence / cost	Apple SDK, free	BSD, free, vendorable

The fix — RNNoise

RNNoise is an open-source neural noise suppression library written by Jean-Marc Valin — the same person who principal-authored Opus and helped design Vorbis — at xiph.org. BSD-licensed, pure C, ships a pre-trained model, runs entirely locally with no network calls.

It does exactly one thing: take a 10 ms frame of 48 kHz mono audio, return a 10 ms frame with the noise gone. It does not touch the output device. It does not need a reference signal. It does not have an aggregate-device construction step. It does not fail differently on different machines.

The library is small enough to vendor directly into the app, which we did:

tree · vendored RNNoise

Swifly/Utilities/RNNoise/
├── rnnoise.h            (public API: 6 functions, ~3 KB)
├── denoise.c            (main processing)
├── rnn.c                (GRU inference)
├── kiss_fft.c           (FFT primitives)
├── pitch.c              (pitch estimation)
├── celt_lpc.c           (linear predictive coding)
├── nnet.c               (neural net inference)
├── rnnoise_data.c       (pre-trained model weights, quantised)
├── vec_neon.h           (ARM NEON SIMD path)
└── …

~25 C files. The 8-bit-quantised model weights add ~30 MB to the binary. We pay that cost happily.

Integration into a Swift project means a bridging header:

c · Swifly-Bridging-Header.h

#include "Utilities/RNNoise/rnnoise.h"

Then update HEADER_SEARCH_PATHS to find the other C files when they include each other, define RNNOISE_BUILD=1, and Xcode treats the C source as part of the target.

The Swift wrapper handles three details the raw C API leaves to the caller: frame slicing (RNNoise processes exactly 480 samples per call — 10 ms at 48 kHz — and we accumulate input and emit output in whole frames, carrying a residual to the next call; on capture stop we flush the final partial frame with zero padding), sample range (RNNoise's float * inputs are notionally ±32768, not the ±1.0 AVFoundation uses; the wrapper scales both ways), and lifecycle (RAII via a Swift class with a deinit). About 175 lines of Swift, including comments. The C library does the work.

The new pipeline

Everything runs per-buffer on the CoreAudio render thread. No conditional branches, no per-device heuristics, no fallbacks, no caching, no probing, no warm-engine preservation, no transport-type detection. The same code runs identically on built-in mics, USB mics, Bluetooth headsets, Thunderbolt audio interfaces, AirPods, surround output, mono output, no output.

Audio capture pipeline

HAL Audio Unit

device-native PCM (any rate, any channel count)

↓

AVAudioConverter

48 kHz mono Float32 (RNNoise's required input)

↓

RNNoiseDenoiser

denoised 48 kHz mono Float32 · ~1% of one CPU core

↓

AVAudioConverter

16 kHz mono Float32 (transcriber's preferred rate)

↓

AVAudioFile — AAC encoder

MPEG-4 AAC, 16 kHz mono, 32 kbps CBR → audio.mp4

The numbers

1,420 → 798

Lines of capture code

Most deletions were fallback scaffolding

~1% CPU

RNNoise hot path

One core, Apple Silicon, single channel

4×

Smaller wire payload

16 kHz mono AAC @ 32 kbps CBR

36 h

Total engineering effort

~12 h of which was disproving VPIO

What shipped

A runtime toggle in the Settings UI lets anyone (us, the client, end users) A/B the same phrase with RNNoise on and off. Same UserDefaults flag is the escape hatch — if RNNoise ever misbehaves in production for some specific setup, flip the toggle and the rest of the pipeline (HAL capture, format conversion, encoding) carries on unchanged. Graceful degradation comes for free with the architecture.

Smaller fixes we shipped alongside

Recording-start timestamp now comes from the keydown event itself, not from when the audio engine finishes initialising. Fixed a "recording too short for transcription" error that users were seeing when they spoke quickly during the old VPIO setup delay.
Status icon in the menu bar flips to "recording" on keydown instead of after the engine starts. UI feels instantaneous because it is.
On device changes during recording (plugging or unplugging AirPods, docking or undocking a USB-C dock), the HAL unit is rebuilt on the new device while the audio file stays open across the restart. The recording continues, transparently.
Settings dialogs no longer break under macOS Light appearance — buttons, dividers, icons keep their styling regardless of system theme.

Could you be over-engineering AEC?

Quick diagnostic for any macOS or iOS voice-capture project. Tick each that applies:

Self-audit

If you tick 3+, you're probably paying AEC's cost for nothing.

My app only ever captures audio; it doesn't simultaneously play audio back to the same user during the same session. My output device is muted or unused while the mic is active (dictation, voice notes, push-to-talk, STT input for an agent). I'm using setVoiceProcessingEnabled(true) or an equivalent bundle that includes AEC. I've debugged at least one weird CoreAudio init failure tied to channel counts, aggregate devices, or output-device configuration. My capture code has per-device branches, fallbacks, or warm-engine preservation hacks I can't fully justify.

Takeaways

Default APIs aren't always the right answer for your use case.

Apple's VPIO is the canonical answer for “I want clean mic audio on macOS.” It's also designed for two-way audio. If you're not doing two-way audio, you're paying for a feature you don't use — and that feature comes with constraints (output-device dependency, stereo reference, aggregate device) you also don't want.

Read your platform's logs.

macOS's CoreAudio is verbose. The auvp and AVFAudio subsystems log every step of VPIO setup. Two of the three weeks we lost would have been one week if we had read those logs on day one instead of reasoning from the outside.

Open-source audio infrastructure is excellent and underused.

xiph.org has been quietly producing world-class audio code for two decades. Opus, Vorbis, RNNoise, FLAC, Theora — all run inside products you use every day. When you find yourself fighting a vendor API, check whether the OSS ecosystem has already solved your specific problem cleanly. In our case it had.

Building voice AI on macOS, iOS, or in the browser?

Free 30-minute code audit. We ship voice AI, real-time STT, neural noise suppression, and Whisper-class transcription pipelines — native, web, and into LiveKit. Tell us the symptom; we'll tell you the layer.

Book a code audit Voice AI & LiveKit work