Published: 2026-06-04 · Reading time: 14 min read · Author: Nikolay Sapunov, CEO at Fora Soft

Why this matters

If you build, buy, or manage a video product — a conferencing app, a streaming service, a telemedicine platform, an e-learning tool — audio is the part your users complain about first. A frozen video frame is forgivable; a robotic, clipped, or out-of-sync voice is not. To make good decisions about audio, or even to ask your engineers the right questions, you need one mental model: how sound becomes numbers, and what the two dials on that conversion actually control. We wrote this for a smart reader with zero audio background; a senior engineer should still find every fact correct. Every later article — on codecs, loudness, the WebRTC pipeline, lip-sync — assumes you hold this model already.


Start with the thing being recorded: a sound wave

Before anything is digital, there is a physical event. Someone speaks, a string vibrates, a door slams. That event pushes the air, creating a wave of tiny pressure changes that travel to your ear and to a microphone. A microphone is a device that turns those pressure changes into a changing electrical voltage. When the pressure is high, the voltage is high; when the pressure drops, the voltage drops. The result is a smooth, continuously wiggling voltage that mirrors the original sound. Engineers call anything that varies smoothly like this an analog signal — "analog" because the voltage is an analogy of the sound.

A smooth wave has two properties worth naming now, because every later article leans on them. Its amplitude is how tall the wave is at any moment — taller means louder. Its frequency is how fast it wiggles up and down — faster means higher-pitched. A low rumble wiggles maybe 50 times a second; a high whistle wiggles 15,000 times a second. The number of wiggles per second has a unit, the hertz (Hz). Human hearing tops out at roughly 20,000 Hz, written 20 kHz, where the "k" means thousand.

A computer cannot store a smooth, continuous wiggle. A computer stores numbers — discrete, separate values. So the central problem of digital audio is this: how do you turn a smooth, never-ending wiggle into a finite list of whole numbers, accurately enough that the ear cannot tell the difference?

The answer: measure the wave on a clock

The trick is to stop trying to capture the whole wave and instead capture its height at regular, closely spaced moments. Picture a clock that ticks tens of thousands of times a second. On every tick, a small chip called an analog-to-digital converter — ADC for short — looks at the microphone's voltage and asks one question: how tall is the wave right now? It writes down that single number and waits for the next tick.

Each of those measurements is called a sample. A sample is just one number: the height of the wave at one instant. String forty-eight thousand of those numbers together and you have one second of digital audio. This whole method — measure the wave at a steady clock rate and store each measurement as a number — is called pulse-code modulation, or PCM. PCM is the rawest, most direct form of digital audio. Every codec you will read about later (AAC, Opus, MP3) is ultimately a clever way to shrink PCM; the PCM samples are the thing being shrunk.

It helps to picture it as a flip-book. A movie is not continuous motion; it is a stack of still photographs shown fast enough that your eye fills in the gaps. Digital audio is the same idea applied to a sound wave: a stack of still "height readings" taken fast enough that your ear fills in the gaps. The faster you take readings, the smoother the illusion.

Diagram showing a smooth analog sound wave being measured at regular clock ticks, with each measurement becoming a numbered sample stored as bits Figure 1. A smooth analog wave (left) is measured at every clock tick. Each measurement becomes one sample — a single number — and each number is stored as bits (right).

Dial one: sample rate — how often you measure

The number of times per second the clock ticks is the sample rate, measured in samples per second, which we also write in hertz. A sample rate of 48,000 Hz, or 48 kHz, means the ADC takes 48,000 height readings every second. This is the first of the two dials that decide audio quality.

Why does the rate matter? Because a wave that wiggles fast needs frequent measurements to be captured at all. If you measure a fast wiggle too rarely, your samples miss the peaks and valleys between measurements, and when the audio is played back the original high pitch is gone — or worse, it reappears as a wrong, lower pitch that was never in the room. That false pitch is called aliasing, and it is the signature defect of measuring too slowly.

There is an exact rule for how fast is fast enough. It is the Nyquist–Shannon sampling theorem, and stated in plain terms it says: to capture a sound accurately, you must sample at least twice as fast as the highest frequency in that sound. Walk the arithmetic out loud. Human hearing reaches about 20 kHz. Twice that is:

2 × 20,000 Hz = 40,000 Hz

So a sample rate of at least 40 kHz is needed to capture everything a person can hear. That is why both common rates sit just above 40 kHz, not below it.

So why two rates — 44.1 kHz and 48 kHz — and not one? The 44.1 kHz rate is a historical accident. When the compact disc was designed around 1980, audio was stored on video tape recorders, and 44,100 happened to fit the number of usable lines on those machines. CD inherited it, and music has carried 44.1 kHz ever since. Video, by contrast, settled on 48 kHz because it divides cleanly into video frame rates, which keeps audio and picture aligned. The Audio Engineering Society makes this official: its recommended practice AES5-2018 (reaffirmed 2023) names 48 kHz as the preferred sampling frequency for professional audio origination, processing, and interchange, while recognizing 44.1 kHz for consumer music, 32 kHz for some transmission, and 96 kHz for higher-bandwidth work.

For a video product the takeaway is simple: use 48 kHz unless a specific source forces otherwise. The extra samples above the 40 kHz minimum are not wasted — they give the anti-aliasing filter, the circuit that removes too-high frequencies before sampling, a little room to work without damaging the audible range.

Dial two: bit depth — how precisely you measure

The sample rate decides how often you measure. The second dial, bit depth, decides how precisely you write down each measurement. Remember the ADC reads the wave's height and records a number. Bit depth is how many digits — specifically, how many binary digits, or bits — that number is allowed to have.

This matters because the wave's true height is a smooth, infinitely precise value, but the recorded number must land on one of a fixed set of rungs, like a thermometer that can only read whole degrees. Forcing the smooth value onto the nearest rung is called quantization, and the small rounding error it introduces is quantization error — heard, if it is audible at all, as a faint hiss. More rungs means smaller rounding and less hiss.

How many rungs does a given bit depth provide? Each extra bit doubles the count. Walk it out:

16 bits → 2^16  = 65,536 rungs
24 bits → 2^24  = 16,777,216 rungs

So 16-bit audio places every sample on one of about 65 thousand levels; 24-bit audio uses nearly 17 million. The practical consequence is dynamic range — the gap between the quietest sound a system can capture and the loudest before it distorts. There is a clean formula linking bits to dynamic range in decibels (dB), the standard ratio unit for audio levels:

Dynamic range (dB) ≈ 6.02 × (bit depth) + 1.76
16-bit: 6.02 × 16 + 1.76 ≈ 98 dB   (commonly rounded to ~96 dB)
24-bit: 6.02 × 24 + 1.76 ≈ 146 dB  (commonly rounded to ~144 dB)

Each bit buys about 6 dB of range. Sixteen bits already exceed the roughly 90 dB range between a quiet room and a painfully loud one, which is why CDs sound clean. Twenty-four bits add headroom that matters while recording and editing — it lets an engineer set levels conservatively and still keep detail — even though the final delivered file rarely needs all of it.

A third option you will meet is 32-bit float. Instead of fixed rungs, it stores each sample in the same flexible number format spreadsheets use for decimals. Its benefit is not more audible detail but forgiveness: a 32-bit-float recording can be pushed far past the normal maximum and pulled back later with no clipping, which is why modern field recorders and editing software use it internally. It is a workflow convenience, not a listening upgrade.

A common mistake: "higher numbers always sound better"

The most expensive misconception in audio is that bigger sample rates and bit depths automatically mean better sound. They do not. Once you are past 48 kHz and 16 bits, you have already captured everything the human ear can detect; 192 kHz and 24-bit files are four times larger and, in blind listening, indistinguishable to listeners for finished playback. The high numbers earn their place during capture and editing, where headroom protects against mistakes — not during delivery. Choosing 96 kHz for a conferencing app does not improve call quality; it wastes bandwidth and CPU. Match the setting to the job.

Putting both dials together: the size of raw audio

Sample rate and bit depth together, plus the number of channels, determine how many bits per second a raw PCM stream needs — its bitrate. The formula is a plain multiplication:

Bitrate = sample rate × bit depth × channels
CD audio: 44,100 × 16 × 2 = 1,411,200 bits/s ≈ 1,411 kbps

That is CD-quality stereo: about 1.4 million bits every second, or roughly 10 megabytes per minute. A phone call sits at the other extreme. Traditional telephone audio, defined by the ITU-T G.711 recommendation, samples at just 8 kHz with 8 bits per sample on one channel:

Telephone (G.711): 8,000 × 8 × 1 = 64,000 bits/s = 64 kbps

Sixty-four thousand bits a second versus 1.4 million — a 22× difference — and both are PCM. The gap is the whole reason codecs exist. Shipping raw 1,411 kbps stereo to every viewer of a live stream, or every participant in a video call, is wasteful, so codecs compress it down to a fraction of the size while keeping it sounding nearly identical. But the thing they compress is always these PCM samples. Understand the samples and the rest of the section follows.

Use case Sample rate Bit depth Channels Raw bitrate
Telephone (G.711) 8 kHz 8-bit 1 (mono) 64 kbps
Wideband voice / conferencing 16 kHz 16-bit 1 (mono) 256 kbps
Video production standard 48 kHz 24-bit 2 (stereo) 2,304 kbps
CD music 44.1 kHz 16-bit 2 (stereo) 1,411 kbps
High-resolution master 96 kHz 24-bit 2 (stereo) 4,608 kbps

Table 1. Raw PCM bitrates across common audio settings. All figures are sample rate × bit depth × channels.

Lossless, lossy, and where this article stops

One last pair of words clears up endless confusion. Lossless means the audio is stored or compressed without throwing any samples away — the original PCM can be reconstructed bit-for-bit. Raw PCM in a WAV file is lossless; so are compressed-but-perfect formats like FLAC. Lossy means the format permanently discards detail the ear is unlikely to notice, in exchange for much smaller files — this is what AAC, Opus, and MP3 do. Neither is "better"; they answer different questions. Archiving a master? Lossless. Streaming to a phone on a train? Lossy. We cover the lossless family in PCM, WAV, AIFF, FLAC, ALAC: lossless formats explained and the lossy codecs across Block 2.

This article deliberately stops at the samples themselves. How fast to sample, how many bits, how many channels, how to chunk and carry the result — each gets its own article next.

Where Fora Soft fits in

Every product we build at Fora Soft turns sound into PCM samples at its core, whether the audio then travels live over WebRTC or is packaged for streaming. In video conferencing and telemedicine, we usually run wideband 16 kHz mono capture because voice clarity, not music fidelity, is the goal, and the lower rate leaves CPU and bandwidth for echo cancellation and noise suppression. In OTT and e-learning playback, we deliver 48 kHz stereo or surround to match the picture. Getting the sample rate and bit depth right at capture is the cheapest quality decision in the whole pipeline — and the most common one teams get wrong by reaching for numbers that are higher than the job needs.

What to read next

Call to action

References

  1. AES5-2018 (r2023)AES recommended practice for professional digital audio: Preferred sampling frequencies for applications employing pulse-code modulation. Audio Engineering Society. Tier 1 (standard). Names 48 kHz as the preferred professional sampling frequency; recognizes 44.1, 32, and 96 kHz. https://www.aes.org/publications/standards/search.cfm?docID=14 (accessed 2026-06-04).
  2. ITU-T Recommendation G.711 (1988, in force)Pulse code modulation (PCM) of voice frequencies. International Telecommunication Union. Tier 1 (standard). Defines 8 kHz / 8-bit telephony PCM at 64 kbit/s (μ-law and A-law). https://www.itu.int/rec/T-REC-G.711 (accessed 2026-06-04).
  3. Nyquist–Shannon sampling theorem — C. E. Shannon, "Communication in the Presence of Noise," Proceedings of the IRE, 1949. Tier 5 (foundational academic source). The minimum-sampling-rate rule. https://doi.org/10.1109/JRPROC.1949.232969 (accessed 2026-06-04).
  4. Audio bit depth — quantization, dynamic range, and the 6.02 dB-per-bit / +1.76 dB relationship. Cross-checked against Analog Devices technical article "Relationship of Data Word Size to Dynamic Range." Tier 4 (vendor engineering). https://www.analog.com/en/resources/technical-articles/relationship-data-word-size-dynamic-range.html (accessed 2026-06-04).
  5. ITU-R BS.1770-5 (2023)Algorithms to measure audio programme loudness and true-peak audio level. ITU-R. Tier 1 (standard). Cited here only for the dB / loudness vocabulary that follow-on articles build on. https://www.itu.int/rec/R-REC-BS.1770 (accessed 2026-06-04).
  6. Linear PCM (LPCM) — Library of Congress Sustainability of Digital Formats entry on WAVE / PCM. Tier 6 (reference). Confirms LPCM as the standard uncompressed digital-audio representation. https://www.loc.gov/preservation/digital/formats/fdd/fdd000002.shtml (accessed 2026-06-04).
  7. iZotope — "Digital audio basics: sample rate and bit depth." Tier 4 (vendor educational). Used as a competitor reference and for plain-language framing checks, not as a source of standards facts. https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth (accessed 2026-06-04).
  8. CD audio bitrate math — 44,100 × 16 × 2 = 1,411,200 bit/s. Verified arithmetically; corroborated by Omega Recording Studios "What's the Bit Rate?" Tier 6 (reference). https://omegastudios.com/bit-rate/ (accessed 2026-06-04).

Per §4.3.2 source hierarchy: where vendor blogs and the AES5 standard could appear to disagree on "the" sample rate, this article follows AES5 (48 kHz preferred for professional/video) and notes the 44.1 kHz CD heritage as a separate, consumer-music lineage rather than a competing professional standard.