A macOS dictation app, an M1 Max with Spatial Audio enabled, three to six seconds of every recording silently gone. Two wrong hypotheses, one half-right one, then a CoreAudio log told us we'd been engineering around a feature we didn't need.
We've been working on a macOS dictation app called Swifly. The product loop is short: hold a global hotkey, speak, release the hotkey, get clean text back from a Whisper-style speech-to-text backend, inserted at the cursor.
When you build this kind of app, microphone audio quality matters more than you'd expect. Real users record in the wild — keyboard clatter, fans humming, café espresso machines, kids in the background, a 70 dB HVAC unit two metres away. Speech recognition is robust, but it's not magic. Noisy input means dropped words, hallucinated phrases, weird capitalisation. Cleaning the audio before it reaches the model is one of the highest-leverage things you can do for perceived accuracy of the whole pipeline.
So when our client asked us to ship “on-device noise suppression” as part of the next release, we said yes. The plan sounded uncontroversial: use Apple's official voice-processing pipeline, ship in a couple of days, move on.
It took two weeks. We threw away most of the work. We ended up with a much better solution.
Apple ships a single public API for voice cleanup on macOS, called VoiceProcessingIO — VPIO for short. You enable it on an AVAudioEngine input node with one line:
try engine.inputNode.setVoiceProcessingEnabled(true)
Behind that flag, three things happen at once: AEC (Acoustic Echo Cancellation, removes speaker bleed into the mic), NS (Noise Suppression, removes background noise), AGC (Automatic Gain Control, levels out volume). This is the same pipeline Apple uses for FaceTime. It's the canonical answer on every Apple Developer thread and WWDC session about microphone capture.
We built it. Wired up the engine, set the device, configured AGC, disabled the speaker output volume to keep audio from leaking back, added a tap on the input node, started the engine. Standard stuff.
It worked perfectly on three of our four test machines.
On the fourth machine — the client's own M1 Max MacBook Pro — the first three to six seconds of every recording were silently dropped. Same code, same Xcode build, same app. Same hotkey behaviour. Same waveform written to disk for the last N seconds of speech. The first N seconds were just gone.
When you can't reproduce on your own machine, you start grasping at hypotheses. We had three.
Maybe the UI lag between keydown and engine-running was too long. We reordered the pipeline so the recording state flips on the keydown event before any async work, removed a redundant DispatchQueue.global().sync wrapper that was blocking the main thread 50–200 ms per press, pre-warmed CoreAudio at app launch.
Partial win, didn't fix itApple's automatic gain control ramps from near-zero on every engine start. If speech starts before AGC converges, first words land near silence and the transcriber drops them. We disabled AGC, kept AEC and NS enabled.
Partial win, didn't fix itAEC's adaptive filters need 500 ms–2 s to learn the room. We refactored the lifecycle to keep VPIO running between hotkey presses — on release, just remove the tap; on the next press, install a new tap on the still-running engine. AEC state stays converged across presses.
Did nothing on the broken machineThe first two helped on machines where VPIO already worked. The third didn't help on the broken machine at all. The first sample was still empty. Something deeper was wrong.
So we did what we should have done sooner: added clear NSLog traces around the VPIO setup, reproduced the bug, and read the logs.
Here's what we saw on the client's machine, line by line:
AUVoiceProcessor instantiated AUHAL stream format error. auhalInBusInScopeASBD channel count: 6 auhalOutBusOutScopeASBD channel count: 2 mic channel count: 0 ref channel count: 6 client-side input and output formats do not match (err=-10875) AUVoiceProcessor initialize failure. err=-10875
VPIO wasn't slow on this machine. It wasn't running at all. It was failing to initialise on every single capture attempt, with err=-10875 — “client-side input and output formats do not match.” Each failure took roughly two seconds of wall time during which the audio path was blocked and nothing was captured. The first words were vanishing into the gap between “I tried to start VPIO” and “VPIO told me no.”
The actual first error in the chain isn't -10875. It's -10879 — kAudioUnitErr_InvalidPropertyValue — emitted earlier. It points at the actual mismatch: VPIO expects, in its internal accounting, a 6-channel input stream and a 2-channel output stream, paired into a single aggregate device. The reference channel count it logged was 6.
Six channels. On a microphone that's physically mono.
The MacBook Pro 14" and 16" with M1 Pro/Max have built-in speakers that, since macOS Monterey, expose themselves as 5.1 surround when Spatial Audio is enabled. From the user's perspective this is a quality-of-life feature for movies and music. From CoreAudio's perspective, it changes the speaker device's reported channel count from 2 to 6, and that change ripples down into every API that touches the output device.
VPIO touches the output device. Specifically, it builds an internal aggregate device that pairs the input (your mic) with the output (your speakers) so its echo cancellation can subtract speaker output from mic input. The architecture assumes the output is stereo; the AEC expects a 2-channel reference signal. On a 6-channel output device, the aggregate-device construction can't produce a format pairing that satisfies its own internal invariants. The init bails with -10875.
It's a real bug in Apple's pipeline. It also doesn't matter, because we shouldn't have been using it.
Here's the thing we should have noticed three weeks earlier: for a dictation app, AEC has nothing to cancel.
AEC removes the sound of your own speakers from the mic signal. It's what stops your voice coming back through your friend's mic on a Zoom call. It's essential for two-way audio.
Dictation is not two-way audio. There is no playback. The speakers are silent during recording. There is no echo to cancel. AEC is a free pass-through doing nothing.
But AEC is also the only reason VPIO's pipeline depends on the output device at all. NS is a pure mic-side operation. AGC is a pure mic-side operation. AEC is what makes VPIO build an aggregate device, demand a stereo reference, and fall over on Spatial Audio Macs, Bluetooth headsets, and aggregate output configurations.
The right thing wasn't to make AEC work on Spatial Audio. The right thing was to throw AEC away.
Apple's API doesn't let you. setVoiceProcessingEnabled(true) enables the bundle. There is no setNoiseSuppressionEnabled(true) independent of it. So we needed a noise suppressor that was not Apple's.
RNNoise is an open-source neural noise suppression library written by Jean-Marc Valin — the same person who principal-authored Opus and helped design Vorbis — at xiph.org. BSD-licensed, pure C, ships a pre-trained model, runs entirely locally with no network calls.
It does exactly one thing: take a 10 ms frame of 48 kHz mono audio, return a 10 ms frame with the noise gone. It does not touch the output device. It does not need a reference signal. It does not have an aggregate-device construction step. It does not fail differently on different machines.
The library is small enough to vendor directly into the app, which we did:
Swifly/Utilities/RNNoise/ ├── rnnoise.h (public API: 6 functions, ~3 KB) ├── denoise.c (main processing) ├── rnn.c (GRU inference) ├── kiss_fft.c (FFT primitives) ├── pitch.c (pitch estimation) ├── celt_lpc.c (linear predictive coding) ├── nnet.c (neural net inference) ├── rnnoise_data.c (pre-trained model weights, quantised) ├── vec_neon.h (ARM NEON SIMD path) └── …
~25 C files. The 8-bit-quantised model weights add ~30 MB to the binary. We pay that cost happily.
Integration into a Swift project means a bridging header:
#include "Utilities/RNNoise/rnnoise.h"
Then update HEADER_SEARCH_PATHS to find the other C files when they include each other, define RNNOISE_BUILD=1, and Xcode treats the C source as part of the target.
The Swift wrapper handles three details the raw C API leaves to the caller: frame slicing (RNNoise processes exactly 480 samples per call — 10 ms at 48 kHz — and we accumulate input and emit output in whole frames, carrying a residual to the next call; on capture stop we flush the final partial frame with zero padding), sample range (RNNoise's float * inputs are notionally ±32768, not the ±1.0 AVFoundation uses; the wrapper scales both ways), and lifecycle (RAII via a Swift class with a deinit). About 175 lines of Swift, including comments. The C library does the work.
Everything runs per-buffer on the CoreAudio render thread. No conditional branches, no per-device heuristics, no fallbacks, no caching, no probing, no warm-engine preservation, no transport-type detection. The same code runs identically on built-in mics, USB mics, Bluetooth headsets, Thunderbolt audio interfaces, AirPods, surround output, mono output, no output.
1,420 → 798
Lines of capture code
Most deletions were fallback scaffolding
~1% CPU
RNNoise hot path
One core, Apple Silicon, single channel
4×
Smaller wire payload
16 kHz mono AAC @ 32 kbps CBR
36 h
Total engineering effort
~12 h of which was disproving VPIO
A runtime toggle in the Settings UI lets anyone (us, the client, end users) A/B the same phrase with RNNoise on and off. Same UserDefaults flag is the escape hatch — if RNNoise ever misbehaves in production for some specific setup, flip the toggle and the rest of the pipeline (HAL capture, format conversion, encoding) carries on unchanged. Graceful degradation comes for free with the architecture.
01
Default APIs aren't always the right answer for your use case.
Apple's VPIO is the canonical answer for “I want clean mic audio on macOS.” It's also designed for two-way audio. If you're not doing two-way audio, you're paying for a feature you don't use — and that feature comes with constraints (output-device dependency, stereo reference, aggregate device) you also don't want.
02
Read your platform's logs.
macOS's CoreAudio is verbose. The auvp and AVFAudio subsystems log every step of VPIO setup. Two of the three weeks we lost would have been one week if we had read those logs on day one instead of reasoning from the outside.
03
Open-source audio infrastructure is excellent and underused.
xiph.org has been quietly producing world-class audio code for two decades. Opus, Vorbis, RNNoise, FLAC, Theora — all run inside products you use every day. When you find yourself fighting a vendor API, check whether the OSS ecosystem has already solved your specific problem cleanly. In our case it had.
Free 30-minute code audit. We ship voice AI, real-time STT, neural noise suppression, and Whisper-class transcription pipelines — native, web, and into LiveKit. Tell us the symptom; we'll tell you the layer.