
Key takeaways
• Silence trimming = find quiet regions, cut them, stitch the rest. On iOS the clean path is AVAudioEngine + AVAudioPCMBuffer to detect silence, and AVAssetExportSession with an AVMutableComposition to render the trimmed file.
• Get the detection thresholds right or users hate it. “Silence” is not absolute zero — ambient noise sits around -60 to -50 dBFS, conversational pauses at -45 to -30 dBFS. Target an adjustable threshold with 250–400 ms minimum silence length and 50–100 ms padding around each cut.
• Trim in a background task, not on the main thread. A 30-minute voice memo can take 10–30 s to analyse on-device; run the scan off-main, report progress via Progress, and keep the UI responsive.
• Trimmed files shrink 30–55% on talk-heavy content. Real gain on podcast, voice note, and language-learning recordings we have shipped. Aggressive thresholds hit 60% but start to clip consonants — tune per domain.
• Fora Soft has shipped this in production. We run silence-trim pipelines for Input Logger, Speakk, and VocalViews. See § Mini case.
Why Fora Soft wrote this playbook
We have been building audio and video products since 2005 — custom audio-processing, language-learning apps, voice-note platforms, market-research recorders. Silence trimming is one of those features that sounds trivial until you implement it naively, ship it, and users start complaining that their sentences get clipped or that a 12-minute memo took 45 seconds to process.
This playbook captures the iOS implementation we teach new engineers on our audio-product team: the detection algorithm, the render pipeline, the thresholds that ship well across ambient-noise conditions, and the on-device vs cloud trade-off. It is written for Swift 6 / iOS 17+ with a fallback note for iOS 15+.
Shipping silence trimming in an audio or video iOS app this quarter?
We have built and tuned silence-trim pipelines for podcast, language-learning, and market-research products. Share your use case and we will return a thresholds-and-scope plan in one call.
What silence trimming actually does
The algorithm has three phases:
1. Scan. Walk the audio in short windows (10–50 ms) and compute the RMS or peak amplitude of each window. Convert to dBFS so your thresholds make sense across different mic gains.
2. Decide. A window is “silent” if its level is under a threshold (typically -40 to -30 dBFS) for at least a minimum run length (300–500 ms). Short dips inside speech don’t count — we don’t want to cut the pause between words.
3. Render. Stitch the non-silent segments back together, applying a fade of 20–50 ms at each cut to avoid audible clicks, and export the result. On iOS, that is an AVMutableComposition with each non-silent segment inserted in order, plus AVMutableAudioMix for the fade volume ramps, then AVAssetExportSession.
Everything after this section is a concrete implementation of those three phases plus the UX, performance, and testing choices that make the feature production-grade.
Where silence trimming earns its keep
Four product categories get an outsized return from shipping this feature:
1. Voice notes and voice-first messaging. Trim a 90-second rambling voice note down to 40 seconds and send it. Receivers appreciate the tighter playback; senders get a passive “edit” for free.
2. Podcast and audiobook authoring. Creators record long takes with think-pauses, coffee sips, and re-starts. Silence trim is the first pass of any mobile podcast editor.
3. Language-learning and speech-therapy apps. Users record answers to prompts. Trimming the hesitation before the answer makes speech-recognition scoring dramatically more accurate.
4. Market research and qualitative video research. Hours of interview footage reduced by 30%+ without losing content; analyst review time drops proportionally.
Architecture — detection and render layers
On iOS the pipeline splits cleanly into a detection layer and a render layer. Keep them decoupled so you can swap in an ML-based voice-activity detector later without touching the export code.
| Layer | APIs | Output |
|---|---|---|
| Decode & scan | AVAudioFile + AVAudioPCMBuffer |
Per-window RMS in dBFS |
| Threshold & segment | Pure Swift | Array of CMTimeRange to keep |
| Compose | AVMutableComposition + AVMutableAudioMix |
In-memory composition with fades |
| Export | AVAssetExportSession |
.m4a / .mov file on disk |
Detecting silence with AVAudioPCMBuffer
Open the source with AVAudioFile, read it in fixed-size frame blocks, compute RMS per window, and emit a timeline of (time, dBFS) samples.
import AVFoundation
struct LevelSample {
let time: TimeInterval // seconds from start
let db: Float // dBFS; -Float.infinity for pure silence
}
func scanLevels(url: URL, windowSeconds: Double = 0.020) throws -> [LevelSample] {
let file = try AVAudioFile(forReading: url)
let format = file.processingFormat
let sampleRate = format.sampleRate
let windowFrames = AVAudioFrameCount(sampleRate * windowSeconds)
let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: windowFrames)!
var samples: [LevelSample] = []
var cursor: TimeInterval = 0
while file.framePosition < file.length {
let want = min(AVAudioFramePosition(windowFrames), file.length - file.framePosition)
buffer.frameLength = 0
try file.read(into: buffer, frameCount: AVAudioFrameCount(want))
let db = rmsDbFS(buffer)
samples.append(LevelSample(time: cursor, db: db))
cursor += Double(buffer.frameLength) / sampleRate
}
return samples
}
private func rmsDbFS(_ buffer: AVAudioPCMBuffer) -> Float {
guard let channel = buffer.floatChannelData?[0], buffer.frameLength > 0 else {
return -.infinity
}
let n = Int(buffer.frameLength)
var sum: Float = 0
for i in 0..<n { sum += channel[i] * channel[i] }
let rms = sqrtf(sum / Float(n))
return rms > 0 ? 20 * log10f(rms) : -.infinity
}
A 20 ms window at 48 kHz is 960 frames — short enough to catch a fast consonant, long enough for a stable RMS estimate. For speech you can go up to 50 ms without hurting accuracy.
Segmenting — thresholds that ship well
With the per-window dBFS timeline in hand, segment into keep/drop regions. Four parameters matter; the defaults below survive most consumer recording conditions.
- Silence threshold: -40 dBFS. Lower (< -45) to keep more breath; higher (> -35) for aggressive trimming.
- Minimum silence duration: 300 ms. Shorter values clip natural word gaps; longer values miss real pauses.
- Padding around cuts: 80 ms before and after a cut to preserve the attack of the next word.
- Fade duration: 20–50 ms to avoid clicks. Short enough to be inaudible, long enough to mask any zero-crossing discontinuity.
struct Segment {
let range: CMTimeRange
let fadeIn: CMTime
let fadeOut: CMTime
}
func segment(
_ samples: [LevelSample],
thresholdDb: Float = -40,
minSilence: Double = 0.30,
padding: Double = 0.08
) -> [Segment] {
// 1) mark each window silent / loud
var silent = samples.map { $0.db < thresholdDb }
// 2) smooth runs shorter than minSilence to 'loud'
let windowDuration = samples.count > 1
? samples[1].time - samples[0].time : 0.020
let minSilentWindows = Int(minSilence / windowDuration)
var runStart: Int? = nil
for i in 0..<silent.count {
if silent[i] && runStart == nil { runStart = i }
if !silent[i], let s = runStart {
if i - s < minSilentWindows {
for j in s..<i { silent[j] = false }
}
runStart = nil
}
}
// 3) build Segments from contiguous loud runs, with padding
var segments: [Segment] = []
var startIdx: Int? = nil
for i in 0...silent.count {
let isSilent = i == silent.count ? true : silent[i]
if !isSilent && startIdx == nil { startIdx = i }
if isSilent, let s = startIdx {
let start = max(0, samples[s].time - padding)
let endIdx = min(i, samples.count - 1)
let end = samples[endIdx].time + windowDuration + padding
let range = CMTimeRange(
start: CMTime(seconds: start, preferredTimescale: 48000),
duration: CMTime(seconds: end - start, preferredTimescale: 48000)
)
segments.append(
Segment(range: range,
fadeIn: CMTime(seconds: 0.02, preferredTimescale: 48000),
fadeOut: CMTime(seconds: 0.02, preferredTimescale: 48000))
)
startIdx = nil
}
}
return segments
}
Rendering the trimmed file with AVMutableComposition
func exportTrimmed(
sourceURL: URL,
segments: [Segment],
destinationURL: URL
) async throws {
let asset = AVURLAsset(url: sourceURL)
let audioTrack = try await asset.loadTracks(withMediaType: .audio).first!
let composition = AVMutableComposition()
let compAudio = composition.addMutableTrack(
withMediaType: .audio,
preferredTrackID: kCMPersistentTrackID_Invalid
)!
var cursor = CMTime.zero
let audioMix = AVMutableAudioMix()
let params = AVMutableAudioMixInputParameters(track: compAudio)
for segment in segments {
try compAudio.insertTimeRange(segment.range, of: audioTrack, at: cursor)
let segRangeInComposition = CMTimeRange(start: cursor, duration: segment.range.duration)
// 20 ms fade in at the start, 20 ms fade out at the end of each segment
params.setVolumeRamp(
fromStartVolume: 0, toEndVolume: 1,
timeRange: CMTimeRange(start: cursor, duration: segment.fadeIn)
)
let outStart = CMTimeSubtract(CMTimeAdd(cursor, segment.range.duration), segment.fadeOut)
params.setVolumeRamp(
fromStartVolume: 1, toEndVolume: 0,
timeRange: CMTimeRange(start: outStart, duration: segment.fadeOut)
)
cursor = CMTimeAdd(cursor, segment.range.duration)
_ = segRangeInComposition // keep reference for debugging
}
audioMix.inputParameters = [params]
guard let export = AVAssetExportSession(
asset: composition, presetName: AVAssetExportPresetAppleM4A
) else { throw NSError(domain: "export", code: -1) }
export.outputURL = destinationURL
export.outputFileType = .m4a
export.audioMix = audioMix
try await export.export()
}
For video with audio, swap AVAssetExportPresetAppleM4A for AVAssetExportPresetHighestQuality, add a parallel video track to the composition using the same CMTimeRanges, and set outputFileType = .mov.
Progress, cancellation, and memory in long recordings
A 90-minute interview generates about 270,000 detection windows at 20 ms. That is fine for memory on an iPhone 15 but you have to stream it, not load the whole file into RAM. Three rules:
1. Read and scan incrementally. Keep the AVAudioFile open and iterate windows; never load the whole decoded PCM into a single buffer.
2. Run the scan on a utility queue. Report progress via Progress or an async AsyncStream to the UI.
3. Expose cancellation. Use Swift concurrency: Task with try Task.checkCancellation() inside the loop so the user can abort a long export.
Want a real-time silence-trim pipeline rather than post-processing?
We have built streaming variants that trim as the user records, not after. Tell us the throughput and latency you need and we will sketch an architecture in one call.
Real-time variant — trim while recording
Some products want the trimmed file ready the instant the user stops recording. For those cases, do the detection on the tap output of AVAudioEngine.inputNode.installTap and stream the non-silent frames into an AVAudioFile as they arrive.
let engine = AVAudioEngine()
let input = engine.inputNode
let format = input.outputFormat(forBus: 0)
let output = try AVAudioFile(forWriting: outURL, settings: format.settings)
var silenceRun: TimeInterval = 0
let threshold: Float = -40
let minSilence: TimeInterval = 0.30
input.installTap(onBus: 0, bufferSize: 960, format: format) { buffer, _ in
let db = rmsDbFS(buffer)
let dur = Double(buffer.frameLength) / format.sampleRate
if db < threshold {
silenceRun += dur
if silenceRun > minSilence { return } // drop long silences
} else {
silenceRun = 0
}
try? output.write(from: buffer)
}
try engine.start()
Trade-off: real-time trimming commits decisions you cannot undo — if the user later wants the full take, you do not have it. Store both the raw and trimmed files if your product might need the original later, or offer a “redo without trimming” button.
When dBFS thresholds are not enough — VAD and ML
Amplitude thresholding fails in two common scenarios: loud background (HVAC, babble, traffic) and quiet speech (whispered language-learning prompts). Real voice-activity detection (VAD) gets much better results.
1. Apple’s SFSpeechRecognizer with on-device mode (iOS 13+) can tell you when speech is happening. Heavier than dBFS but high accuracy and fully offline. Great for language-learning and accessibility apps.
2. WebRTC VAD. An 8-byte-per-window lightweight classifier ported to Swift via binding. Runs at tens of megabytes per second on an iPhone.
3. Silero VAD via Core ML. Modern neural VAD converted from PyTorch; around 0.5 ms per 30 ms window on an A15 chip. Best accuracy we have shipped.
Swap your detection layer for one of these and the rest of the pipeline (segmenter + renderer) stays identical — the abstraction we recommended at the architecture section pays off here.
On-device vs cloud processing — pick based on privacy and length
The entire pipeline we have shown runs on-device with no server calls. That is the right default: private, offline, and free of per-minute API cost. You should still think about the handful of cases where cloud makes sense:
- Recording > 2 hours. The user doesn’t want their phone warm for 90 s of processing; offload to a worker in your backend and notify via push when done.
- ML-heavy pipeline (VAD + transcription + chaptering). Combines easier in a backend job than on-device.
- Cross-device reuse. Trim once in the cloud, play on iPhone/iPad/web.
For anything under 10 minutes and with a privacy angle, keep it on-device. Users notice “processing in the cloud” spinners and trust them less for voice content.
Mini case — silence trim for a language-learning iOS app
On Input Logger, students record themselves reading prompts. The product scores the recording with a speech-recognition pipeline — and hesitation before the answer (3–8 seconds of quiet nerves) hurt the recognition accuracy badly.
Our three-week fix: added an on-device silence-trim pass with a VAD-based detector (Silero via Core ML) tuned at -38 dBFS / 350 ms minimum silence, stitched the remaining segments with 30 ms fades, and streamed progress to the UI. File sizes shrank by 41% on average; recognition accuracy against reference transcripts improved by 9 percentage points; user-perceived latency from “Stop” to “See my score” dropped from 6 s to 2.2 s. Total engineering effort: roughly 90 hours including QA, accelerated with our Agent Engineering workflow. Want a similar evaluation for your audio product? Book a 30-min review.
UX patterns — don’t surprise the user
Silence trimming silently edits the user’s voice. Four UX guardrails keep the feature from feeling creepy or destructive:
1. Toggle + persistent setting. Let the user disable trimming. Respect that choice between app launches.
2. Before/after stats. Show “Trimmed 14 s from your 2:03 recording”. Users trust the feature more when they can see what it did.
3. Undo. Keep the source file around for at least the current session. Users revert when an aggressive threshold clipped a pause they wanted.
4. Sensitivity slider. Advanced users love a “Gentle / Normal / Aggressive” preset that maps to threshold presets (-45 / -40 / -32 dBFS) and min-silence values.
Testing — how we catch regressions
Ship a fixture library of 15–30 reference recordings covering the real world: quiet studio voice, coffee-shop background, bilingual speech, whispered prompts, overlapping speakers, music underscore. Run the full trim pipeline as a unit test and compare the output against ground-truth segments (manually annotated with Audacity). A regression that moves a segment boundary by more than 40 ms fails the build.
Add a perceptual smoke test: run the trim, play back, and flag any audible click or pop. We do this with a small energy-discontinuity detector that scans the first 10 ms after each cut in the output file.
A decision framework — what to ship in five questions
1. Does your recording live on-device only? Yes → on-device pipeline. No → consider cloud for long content.
2. Is the audio quality controlled (studio, headset, good mic)? Yes → dBFS thresholding is fine. No → VAD (Silero Core ML, WebRTC, SFSpeechRecognizer).
3. Do users need to undo? Yes → always keep the source file for the session. No → real-time tap-based trimming is cheaper.
4. How long is a typical recording? < 5 min → on-device, post-record. 5–30 min → on-device with progress UI. > 30 min → cloud pipeline with push notification.
5. Is audio paired with video? Add a parallel video track in the AVMutableComposition using the same segment time ranges; export as .mov.
Five pitfalls we keep finding in audits
1. Hard-coded thresholds. What works in a studio (-50 dBFS) will trim every pause to a café (-25 dBFS). Auto-calibrate from the first 500 ms of the recording if you can’t expose a slider.
2. No fade on cuts. Hard cuts produce clicks on sensitive speakers; 20–50 ms fades eliminate them.
3. Scanning on the main thread. The UI freezes and users retry, kicking off a second scan. Always dispatch to a utility queue or an async Task.
4. Trimming the source permanently. Users complain; you now have to restore an original you deleted. Keep source files for at least one edit cycle.
5. Ignoring video when present. An audio-only trim on a video file produces a lip-sync disaster. Add the video track to the same composition with matching time ranges.
KPIs — what to measure after shipping
Quality KPIs. Segment-boundary accuracy vs ground truth (target < 80 ms median error), click-detection rate on cuts (target 0), and file-size reduction on a reference 50-recording suite (target 30–50%).
Business KPIs. Share-rate uplift on voice content (trimmed clips get shared more), retention delta on cohorts with trimming enabled, and transcription accuracy lift for apps that score speech.
Reliability KPIs. Export-failure rate (target < 0.5%), median time-to-trimmed for a 2-minute recording (target < 2 s on iPhone 13+), and memory high-water-mark on a 30-minute job (target < 80 MB RSS).
When not to ship silence trimming
1. Music or mixed-content products. A quiet bridge in a song is not a pause; trimming destroys the artistic intent. Disable by default on anything labelled music.
2. Legal-grade or courtroom audio. Any deletion from a recording creates chain-of-custody issues. Never trim evidential audio.
3. Accessibility-critical recordings. Users with speech conditions may need their pauses preserved verbatim. Provide an opt-out and respect it persistently.
Want an expert review of your audio pipeline?
We audit detection thresholds, UX, and on-device vs cloud architecture for iOS audio products. Bring us a Swift file and we will highlight the fixes in one call.
FAQ
What’s a reasonable default silence threshold for an iOS app?
-40 dBFS with a 300 ms minimum run length covers most consumer recording environments (home office, moderate background noise). Auto-calibrate by sampling the first 500 ms of the recording if your users go from studio to coffee shop.
Do I need CoreAudio or can I stay entirely in AVFoundation?
AVFoundation covers everything: AVAudioFile/AVAudioPCMBuffer for decode and level detection, AVMutableComposition + AVMutableAudioMix for cut-stitch-fade, and AVAssetExportSession for export. Drop to CoreAudio only when you need custom sample-rate conversion, exotic codecs, or real-time DSP on a dedicated queue.
Can I trim silence in real time from the microphone stream?
Yes — install a tap on AVAudioEngine.inputNode, compute dBFS per buffer, and only write non-silent buffers to your output AVAudioFile. You lose the ability to undo, but latency drops to essentially zero because the trim happens while the user records.
Why do my cuts produce audible clicks?
You are cutting at non-zero sample values and the discontinuity shows up as a broad-spectrum click. Add a 20–50 ms fade-out before each cut and a matching fade-in after with AVMutableAudioMix.setVolumeRamp. That alone eliminates the problem in most cases.
Does silence trimming work on video files the same way?
Yes, with one important addition: insert the video track into the same AVMutableComposition using the same CMTimeRanges you use for audio. Otherwise your lip-sync breaks. Export with AVAssetExportPresetHighestQuality and outputFileType = .mov.
What about Apple Speech and Siri — can they trim silence for me?
Not directly. SFSpeechRecognizer gives you word timings which you can use as a VAD signal — excellent accuracy, no extra ML work — but it does not produce an edited audio file. You still run the segmenter/composer we described.
How much CPU and battery does this cost on-device?
On an iPhone 13 and newer, an RMS-based scan processes 1 minute of 48 kHz audio in roughly 60 ms. Silero VAD via Core ML costs about 3× that. Export with AVAssetExportSession is I/O-bound — roughly half the recording duration on AAC reencodes. End-to-end battery impact on a 5-minute recording is under 1% on a modern phone.
How long does it take to ship production-grade silence trimming?
On an existing iOS audio product, 5–8 engineering days for an RMS-based pipeline with fades, progress UI, and unit tests. Add 3–5 days for VAD (Core ML Silero or WebRTC VAD) if amplitude thresholding is not accurate enough. Fora Soft typically lands the full scope in under two sprints using our Agent Engineering-accelerated workflow.
What to read next
iOS WEBRTC
WebRTC in iOS Fundamentals
Media pipelines and AV hardware on iOS — the world your trimmer lives in.
VIDEO PRODUCTS
Build Custom Video Conferencing Solutions
Full-stack considerations when audio trimming is part of a video workflow.
E-LEARNING
AI-Powered Multimedia for E‑learning
Where silence-trim slots into language-learning and tutoring products.
iOS
Implement Screen Sharing in an iOS App
The ReplayKit-first companion piece for iOS media-feature engineers.
Ready to ship silence trimming in your iOS audio product?
The algorithm is simple but the UX and tuning are not. Use AVAudioFile + AVAudioPCMBuffer to scan in short windows, threshold with a sensible default (-40 dBFS / 300 ms), compose with AVMutableComposition plus short fades, and export with AVAssetExportSession. Layer on a VAD for noisy environments, keep the source file for undo, and surface before/after stats so users trust the edit.
If you want a team that has built this into language-learning, voice-research, and podcasting iOS apps, Fora Soft has the Swift templates and the QA recordings ready.
Book a 30-minute review of your iOS audio-trimming plan?
We’ll critique your thresholds, detection layer, and render pipeline, and return the fixes that move the product needle. Agent Engineering-accelerated.



.avif)

Comments