Why this matters

If you run a telemedicine platform, an online classroom, a contact centre, or any conferencing product, the recording and transcription features are the ones your customers ask for first and the ones your engineers under-budget worst. The live call is tuned for the lowest possible delay; recording and transcription are a completely different job with a completely different cost curve, and teams routinely bolt them on without realising they have just doubled their server bill or quietly created a folder of unencrypted patient audio. This article is for the product manager, founder, or operations lead who needs to understand where the audio goes once it leaves the live call, so you can read a recording architecture, ask why it mixes or doesn't, and know which branch creates a compliance problem before a regulator finds it. A senior engineer will also find every claim traced to the W3C MediaStream Recording specification, the relevant IETF RFCs, and the published behaviour of Whisper, WhisperX, LiveKit Egress, and the major real-time transcription APIs. By the end you will know the three recording branches, the one-way street that turns voice into a transcript, and the exact moment each choice stops being cheap.


The call and the copy are two different jobs

Hold one idea before anything else, because the whole article hangs on it. A live call has one job: get my voice to your ear as fast as possible. Everything in the real-time audio pipeline — the jitter buffer, the packet loss concealment, the discontinuous transmission that stops sending when you go quiet — exists to shave milliseconds and survive a bad network. None of it cares about keeping a perfect copy. It is built to be heard once and forgotten.

A recording has the opposite job: keep a faithful copy that someone will play back later, possibly in court, possibly years from now. A transcript has a third job again: turn the sound into words and throw the sound away. These are three different jobs with three different definitions of "good", and the audio that is perfect for one is wrong for the others. Live audio that dropped a quiet half-second to save bandwidth is fine to hear and a problem to transcribe. A recording optimised for storage size is fine to archive and a problem to feed an old phone line.

So the question this article answers is not "how do I record a call" — it is "where does the audio leave the live path, and what happens to it after it leaves." That fork is the single most important architectural decision in the recording feature, and most teams make it by accident.

A WebRTC call shown as a fast live path from microphone to speaker across the top, with three branch points peeling off downward. The first branch leaves at the speaker's device and is labelled client-side recording. The second and third branches leave at the server: one labelled server-side per-participant keeps four separate voice tracks, the other labelled server-side mixed combines four voices into one track. A fourth arrow continues from the server into a transcription box labelled ASR, which outputs text. Each branch is annotated with its main trade-off. Figure 1. Where audio leaves the call. The live path (top) is tuned for speed. Recording and transcription are separate pipelines that branch off it — at the device, or at the server, keeping voices separate or mixing them. The transcription branch goes one step further and converts sound to text.


Branch one: client-side recording (the audio never leaves the device)

The simplest place to capture a call is on the device that is already playing it. The browser hands you a tool for exactly this, the MediaRecorder — a built-in recorder, defined in the W3C MediaStream Recording specification, that takes a live audio or video stream and writes it to a file without any server involved. You point it at the same microphone stream the call is using, press start, and it hands you chunks of an encoded file as it goes.

Two details of that specification matter for anyone designing a feature around it. First, the recorder does not give you the whole file at the end in one piece by default. You can ask it to deliver the recording in slices by passing a timeslice value — a number of milliseconds after which it fires a dataavailable event carrying the latest chunk of recorded media as a Blob. A timeslice of 1000 means "hand me a piece of the recording every second." This is how a client-side recorder uploads a long call as it happens instead of holding the entire file in memory until the end. The specification warns explicitly that too large a time slice forces the browser to buffer a large amount of data, which is the textbook way a long recording crashes a browser tab.

Second, you do not get to choose any audio format you like. The recorder writes whatever container and codec the browser supports, declared through its mimeType — typically a WebM container carrying Opus audio, the same codec the live call already uses. The W3C-defined list of codec identifiers the recorder can expose includes opus and pcm, but which ones are actually available is the browser's call, not yours.

What client-side recording costs

The appeal is obvious: the server does nothing, so the server costs nothing. The recording is also the highest-fidelity copy of what that one person heard or said, because it is captured before the audio ever crosses a network. For an audio-only telehealth note that the clinician keeps on their own machine, this is genuinely the cleanest option — the recording stays under the clinician's direct control with no third party in the path.

The costs are three, and they are the reason client-side recording rarely survives contact with a real product. First, it breaks the moment the tab closes. If the person navigates away, refreshes, or their laptop sleeps, the recording stops and whatever was in the buffer can be lost. Second, it only captures one person's view. A client recording on my device has my microphone and the mixed audio I received; it is not a clean separate copy of each participant. Third, it does not scale to many participants — every device recording its own copy means many partial files to collect, reconcile, and stitch, and client-side approaches are widely reported to fall apart above a couple of hundred participants. Client-side recording is the right answer for a single user capturing their own session, and the wrong answer for anything you have to guarantee.


Branch two: server-side per-participant recording (keep every voice separate)

The moment you put a server in the audio path — and above a handful of participants you always do, almost always a Selective Forwarding Unit (SFU) — you get a second, far more reliable place to capture audio. The SFU already receives every participant's voice stream. Server-side per-participant recording simply tells the server to write each of those streams to its own file.

The result is one audio file per speaker, each a clean, isolated copy of that person's microphone, untouched by anyone else's voice. This is the gold-standard input for almost everything you might do later. You can transcribe each track separately and know exactly who said each word without any guessing — the speaker identity is built into the file, not inferred. You can re-mix the call afterwards with different volume balances. You can redact one participant. You can apply noise suppression to a noisy speaker without touching the others. None of that is possible once voices are blended together.

There is an important subtlety about how the server captures these streams. An SFU forwards audio without decoding it — that is the whole point of an SFU, it never turns the compressed packets back into sound. So a per-participant recorder has two options. It can store the raw forwarded packets exactly as they arrive, which is cheap but produces a file with the gaps and quirks of the live stream baked in, including the silent stretches that DTX left behind. Or it can subscribe to each track as a participant would, decode it, and write a continuous clean file — which costs CPU but produces a recording you can actually play and transcribe reliably. Production systems that promise a usable recording do the second.

The cost of keeping voices separate

Per-participant recording costs storage and orchestration rather than mixing CPU. Each speaker is a separate file, so storage grows linearly with the number of people who talk. For a tool like LiveKit's Egress service, a separate recording component joins the room as a hidden participant, subscribes only to the tracks it needs, and writes them out — which is why recording on an SFU is never free even though the live forwarding was. You are paying for a second consumer of every stream.

The numbers matter when calls get long. A single Opus voice track running at a typical 32 thousand bits per second produces, per hour:

32,000 bits/second ÷ 8 bits/byte = 4,000 bytes/second
4,000 bytes/second × 3,600 seconds = 14,400,000 bytes ≈ 14.4 MB per hour

That is one speaker. A four-person call recorded per-participant is four of those files, about 57.6 MB per hour, plus whatever video you keep. Multiply by every call your platform records, every day, retained for however long your compliance rules demand, and the per-participant tax becomes a real line item. The deep version of this arithmetic — across multiple languages and codecs — lives in Storage and CDN Math for Audio.


Branch three: server-side mixed recording (combine every voice into one)

The third branch is the one most people picture when they say "record the call": one file with everyone's voice in it, ready to play in any media player. To produce it, the server must do the expensive thing an SFU normally refuses to do — decode every participant's audio back into raw sound, add the sounds together into a single mix, and re-encode that mix into one file. This is the same decode-mix-re-encode work a Multipoint Control Unit (MCU) does to run a call, applied here only to the recording.

The output is the most convenient possible artifact. One file plays anywhere. There is no stitching, no reconciling timestamps across separate tracks, no client to coordinate. For a webinar archive, a podcast-style recording, or any case where a human will simply press play, mixed recording is the natural fit. It is also the only format a "dumb" consumer can handle — if your recording has to be played by a system that can manage exactly one audio stream, mixing is mandatory.

Why mixed recording is the expensive branch

The convenience is paid for in CPU and in lost information, and both are permanent. The CPU cost is the decode-and-re-encode for every stream, continuously, for the whole call — the same workload that makes an MCU far more expensive per room than an SFU. LiveKit's room-composite recording, for example, literally launches a headless Chrome instance that joins the room, renders the layout, and encodes the result with GStreamer; that is a whole browser per recording, not a cheap byte copy.

The information cost is worse because it cannot be undone. Once four voices are added into one waveform, you can never cleanly separate them again. You cannot reliably transcribe "who said what" from a mix — you can only guess, using speaker diarization, which is itself an error-prone extra step (more on that below). You cannot turn one person down. You cannot redact a single participant without re-recording. Every downstream feature that needs to know who spoke is now fighting against a format that deliberately threw that information away.

This is the single most common recording mistake we see: a team picks mixed recording because it is the easy demo, ships it, and then a customer asks for per-speaker transcripts or the ability to remove one participant for privacy — and the answer is "we'd have to re-architect the recording." Decide whether you need per-speaker information before you mix, because mixing is a one-way door.

Pitfall: the silent-mic mix surprise. Mixed recording on a call that uses DTX can produce a file where a speaker's track has gaps — DTX stops transmitting during silence to save bandwidth, and a naive mixer records those gaps as dead air or a jarring jump. A correct recording mixer inserts comfort noise or holds the level across DTX gaps. If your recordings sound like people are cutting in and out when they pause, suspect the DTX-to-mix handoff, not the microphones.


A side-by-side on the three recording branches

Criterion Client-side Server, per-participant Server, mixed
Where capture happens The user's device The media server (SFU) The media server (MCU-style)
Server CPU cost None Low (decode per track) High (decode + mix + re-encode)
Storage shape One file per device One file per speaker One combined file
Knows "who said what" Only this user Yes, built in No, must guess
Survives tab close No Yes Yes
Plays in any player Yes Not directly Yes
Best for transcription Weak Strongest Weakest
Re-mix / redact later No Yes No
Typical use Solo self-capture Compliance, transcription Webinar, podcast archive

The pattern in that table is the whole decision. If a human will press play and nobody needs to know who spoke, mix it. If a machine will read it, or a regulator might, or you need per-speaker anything, keep the voices separate and pay the storage. If it is one person capturing their own session and reliability does not matter, do it on the device.


The fork inside the fork: recording versus transcription

Up to now we have followed audio that stays as audio. Transcription is a different destination, and the audio that goes there takes a turn that surprises most people: the speech recognition engine does not want your nice recording. It wants something smaller, cruder, and shaped for a machine, and it produces that shape by deliberately destroying most of what makes the recording sound good.

This is worth stating plainly because it reframes the whole feature. A recording and a transcript are not the same file at different quality levels. They are two products of the same voice that diverge early and never meet again. The recording branch tries to preserve the sound. The transcription branch tries to extract the words and discards the sound on purpose.


How audio becomes text: the ASR audio path

Automatic speech recognition, called ASR — the technology that turns spoken audio into written words — runs the audio through a short pipeline before any "recognition" happens at all. Understanding that pipeline tells you why a transcript can be wrong even when the recording sounds perfect, and why the audio you feed an ASR engine has to be prepared differently from the audio you archive.

Step one: resample to 16 kHz

A live WebRTC call almost always carries Opus audio. Internally, as RFC 6716 specifies, Opus runs its main transform at 48 thousand samples per second — a high rate that captures the full range of human hearing, which is exactly what you want for music and for a faithful recording. Speech recognition does not want it. Human speech lives almost entirely below 8 thousand cycles per second, and a foundational result of digital audio (covered in What Is Digital Audio) says you only need to sample at twice the highest frequency you care about. So almost every modern ASR engine — Whisper included — first resamples the audio down to 16 thousand samples per second. Everything above 8 kHz is thrown away. For music that would be vandalism; for speech it is free, because the words were never up there.

Step two: cut the audio into tiny overlapping frames

The engine then chops the 16 kHz audio into very short, overlapping windows. Whisper, the widely used open speech model from OpenAI, uses a 25-millisecond window that slides forward 10 milliseconds at a time. Each window is short enough that the sound inside it is roughly steady — a single fragment of a vowel or consonant rather than a whole word. The 10-millisecond slide means the windows overlap, so nothing falls through the cracks between frames. This is the same frame-and-packet idea that underlies all digital audio, explained in Frames, Packets, Granules; ASR just uses its own frame size tuned to the rhythm of speech.

Step three: turn each frame into a log-mel spectrogram

Here is the step that throws the sound away. Each tiny frame is converted from a wave in time into a measure of how much energy sits in each band of frequency, then those bands are squeezed onto a perceptual scale and their loudness is compressed with a logarithm. The result is called a log-mel spectrogram, and it is the actual input to the recognition model — not the audio.

The "mel" part is a frequency scale built to match human hearing rather than physics. Our ears tell low pitches apart far better than high ones, so the mel scale packs many narrow bands down low and a few wide bands up high, spacing them the way we actually perceive pitch. Whisper uses 80 of these mel bands. The "log" part compresses the loudness range the same way our ears do, so a whisper and a shout end up closer together than their raw energy would suggest — the same logarithmic perception that underlies loudness measurement in LUFS.

Put the numbers together and you can see exactly how much is discarded. Whisper processes audio in 30-second chunks. At a 10-millisecond frame step, 30 seconds is 3,000 frames, and each frame is described by 80 mel numbers:

30 seconds ÷ 0.010 seconds/frame = 3,000 frames
3,000 frames × 80 mel bands = 240,000 numbers per 30-second chunk

A 30-second stretch of 16 kHz audio is 480,000 raw samples per second of mono, so 14.4 million samples in 30 seconds. The log-mel representation describes that same half-minute in 240,000 numbers. The engine has compressed the sound roughly sixty-fold and, crucially, the compression is one-way. You cannot turn a log-mel spectrogram back into listenable audio. The transcript is built from a fingerprint of the voice, not the voice.

Why log-mel and not the older MFCC

If you read older speech-recognition material you will meet a close cousin, the MFCC — mel-frequency cepstral coefficients — which adds one more mathematical step (a discrete cosine transform) on top of the log-mel spectrogram to compress and de-correlate it further. For decades MFCCs were the undisputed standard. Modern deep-learning ASR engines, Whisper among them, dropped that last step and feed the log-mel spectrogram directly. The reason is simple: a large neural network is powerful enough to learn the patterns itself, so giving it the less-processed log-mel features lets it find correlations the fixed MFCC math would have thrown away. The trend across the field is toward feeding the model less hand-engineered audio, not more.

A left-to-right signal-flow diagram of the ASR audio path. It starts with an Opus stream at 48 kHz, flows into a resampler that outputs 16 kHz, then into a framing block labelled 25 ms window, 10 ms step, then into a mel filterbank with 80 bands, then a log compression block, producing a log-mel spectrogram of shape 80 by 3000 for a 30-second chunk. That feeds an ASR model box which outputs text. A red one-way arrow under the spectrogram is labelled cannot be reversed to audio. Figure 2. The ASR audio path. Opus at 48 kHz is resampled to 16 kHz, cut into 25 ms frames every 10 ms, turned into 80 log-mel bands, and stacked into an 80×3000 spectrogram per 30-second chunk. From there the model reads features, not sound — and the conversion is one-way.


Who said that? Diarization, the step mixing makes hard

A transcript that reads as one undifferentiated wall of text is far less useful than one that says "Dr. Lee:" and "Patient:" before each line. Working out who spoke each segment is a separate problem from working out what was said, and it has its own name: speaker diarization — partitioning the audio into stretches and labelling each with a speaker. Diarization tools such as pyannote and NeMo produce their answer in a standard format called RTTM, which simply lists, for each segment, a start time, a duration, and a speaker label. A pipeline like WhisperX then merges that timeline with Whisper's word timings so each word carries a speaker tag.

Now the recording decision from earlier comes back to bite. If you recorded per-participant, you never need diarization at all — each track is one speaker, so "who said what" is already answered, perfectly, for free. If you recorded mixed, the speakers are blended into one waveform and diarization has to reconstruct the boundaries by ear, which is error-prone: it confuses similar voices, fumbles overlapping speech, and adds latency and cost. This is the concrete downstream price of mixing that the comparison table hinted at. Choosing per-participant recording is, among other things, choosing to never have a diarization problem.


The other clock: real-time captions

Everything so far described audio leaving a call to be processed afterwards. Live captions are a harder variant of the same pipeline, because now the transcript has to appear while people are still talking, and a new constraint dominates: latency. The audio path is the same — resample, frame, log-mel, model — but it runs continuously on a sliding window of incoming audio, and it has to commit to words before the sentence is finished.

This forces a design every real-time transcription system shares: partial results and final results. A partial result is the engine's best guess so far, shown immediately and allowed to change as more audio arrives; a final result is the committed transcript for a finished stretch of speech. The caption you see updating itself a word at a time is the partials; the version that stops wobbling and stays put is the final. Industry real-time engines typically deliver partials in 200 to 500 milliseconds and finals a few hundred milliseconds after the speaker stops, with sub-300-millisecond partials and roughly 700-millisecond finals being a common benchmark on short utterances.

There is a genuine trade-off baked into that timing, and it is worth understanding before you set expectations with users. Showing a partial faster means showing a less certain guess that may visibly correct itself. Waiting for a final means a steadier caption that lags further behind the speaker. Live accessibility captions usually favour speed and tolerate the occasional self-correction, because a caption that arrives after the moment has passed is worse than one that fixes itself. A legal transcript favours the final and accepts the lag. The same engine can do either; the product decides which clock it runs on.

Pitfall: feeding the captioner DTX-thinned audio. A real-time pipeline that taps the live, bandwidth-optimised stream inherits its compromises. If the live stream used aggressive DTX or heavy packet loss concealment, the captioner sees audio with invented or missing fragments and produces confident nonsense. Where caption accuracy matters, branch the ASR feed off a cleaner copy — the decoded per-participant track, not the raw forwarded packets.


Where the audio actually goes after it leaves: the compliance branch

There is one more leg of the journey, and it is the one that turns a feature into a liability if you skip it. The moment audio leaves the live call and becomes a stored file, it stops being ephemeral conversation and becomes retained personal data — and a thicket of law now applies that has nothing to do with codecs.

Two facts every team building recording should know. First, consent is not uniform. In the United States, recording-consent law splits by state: a majority allow one-party consent (one person on the call agreeing is enough), but roughly a dozen states require all parties to consent before a call may be recorded. A product that records by default can be lawful in one state and a crime in another. The safe design is to announce and capture consent for everyone, every time. Second, stored voice is regulated data. Under the EU's GDPR a recording is personal data: consent must be freely given and specific, individuals can request a copy or deletion, and the recording must be deleted when it is no longer needed. In healthcare, a recording that captures protected health information must be encrypted in transit and at rest, access-controlled, and retained under the applicable retention rules — six years under HIPAA, for example, against GDPR's "delete when no longer needed."

The architectural consequence is concrete: the recording pipeline is not done when the file is written. It needs encryption at rest, access controls, a retention-and-deletion policy that can honour a deletion request, and an audit trail. Mixed recordings make deletion-of-one-participant impossible by construction, which is a GDPR problem waiting to happen. Per-participant recordings make per-person deletion clean. The legal frame should influence the recording branch you choose, not just bolt on afterwards.


Where Fora Soft fits in

We build the products where this pipeline is load-bearing: telemedicine platforms where every consultation may need a compliant, encrypted, per-speaker record; e-learning systems where lectures are recorded and auto-captioned for accessibility; video conferencing and contact-centre tools where transcription and search across recordings are core features; and video surveillance, where audio capture carries its own consent rules. Across those verticals the recurring lesson is the one in this article — decide where audio leaves the call, and whether it stays separate or gets mixed, before you write the recording feature, because the transcription quality, the storage bill, and the compliance posture all follow from that one fork. When we scope a recording or transcription feature, the architecture conversation starts there, not at the codec.


What to read next

Call to action

References

  1. W3C, MediaStream Recording, W3C Working Draft (current edition, 2025). The MediaRecorder interface, the timeslice parameter and dataavailable event, the mimeType property, and the codec identifier list including opus and pcm. https://www.w3.org/TR/mediastream-recording/ — primary source for the client-side recording branch. (W3C status: Working Draft, not yet Recommendation; noted per spec-status rule.)
  2. IETF RFC 6716, Definition of the Opus Audio Codec (September 2012), §2 and §5.1. Opus runs its MDCT layer internally at 48 kHz and resamples on decode; basis for the 48 kHz live-call figure. Updated by RFC 8251 (2017). https://www.rfc-editor.org/rfc/rfc6716.html
  3. IETF RFC 8251, Updates to the Opus Audio Codec (October 2017). Cited alongside RFC 6716 as the current Opus normative pair. https://www.rfc-editor.org/rfc/rfc8251.html
  4. Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper), OpenAI, 2022. The 16 kHz resample, 25 ms / 10 ms framing, 80-channel log-mel spectrogram, 30-second chunking, and (80, 3000) feature shape. Primary source for the ASR audio path. https://cdn.openai.com/papers/whisper.pdf
  5. MDN Web Docs, MediaRecorder and dataavailable event, Mozilla, accessed 2026-06-06. Secondary, browser-behaviour reference for the W3C MediaRecorder spec. https://developer.mozilla.org/en-US/docs/Web/API/MediaRecorder — used only to confirm real-world browser behaviour; the W3C spec is the source of truth.
  6. LiveKit, Egress overview and livekit/egress repository, accessed 2026-06-06. Server-side recording on an SFU: a hidden EGRESS participant subscribes to tracks; room-composite launches headless Chrome and encodes with GStreamer; track-composite combines one audio + one video track. https://docs.livekit.io/transport/media/ingress-egress/egress/ and https://github.com/livekit/egress
  7. BlogGeek.me (Tsahi Levent-Levi), WebRTC recording challenges and solutions and Recording WebRTC Sessions: client side or server side?, accessed 2026-06-06. Vendor-neutral analysis of client vs server, mixed vs per-participant recording trade-offs. https://bloggeek.me/webrtc-recording/
  8. Daily.co, Why recording WebRTC is so hard, accessed 2026-06-06. Production constraints of mixed vs per-track server recording. https://www.daily.co/blog/why-recording-webrtc-is-so-hard-2/
  9. Bredin et al., pyannote.audio and the RTTM annotation format; MahmoudAshraf97/whisper-diarization and m-bain/whisperX repositories, accessed 2026-06-06. Diarization output in RTTM merged with Whisper word timings. https://github.com/m-bain/whisperX
  10. Gladia, How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF) and AssemblyAI, Real-Time Speech to Text / Best API models for real-time speech recognition (2026), accessed 2026-06-06. Partial-vs-final latency benchmarks (200–500 ms partials; ~700 ms finals on short utterances). https://www.gladia.io/blog/measuring-latency-in-stt and https://www.assemblyai.com/blog/real-time-speech-to-text
  11. APXML, Filter Banks and Log-Mel Spectrograms / Comparing MFCCs and Spectrograms, accessed 2026-06-06. Why modern deep-learning ASR prefers log-mel over MFCC (the dropped DCT step). https://apxml.com/courses/applied-speech-recognition/chapter-2-feature-extraction-for-speech/filter-banks-log-mel-spectrograms
  12. Recapmycalls / CloudTalk, Call Recording HIPAA Compliance / Requirements (2026) and general GDPR call-recording guidance, accessed 2026-06-06. One-party vs all-party consent split, GDPR personal-data treatment of recordings, HIPAA encryption and six-year retention. https://recapmycalls.com/call-recording-hipaa-healthcare/ and https://www.cloudtalk.io/blog/hipaa-call-recording-requirements/

Standards/primary sources among the above: W3C MediaStream Recording (ref. 1), IETF RFC 6716 (ref. 2), IETF RFC 8251 (ref. 3), and the Whisper primary paper (ref. 4) — four primary sources, meeting the ≥3 requirement for an article touching audio APIs and codecs.