Published: 2026-06-05 · Reading time: 14 min read · Author: Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you build video conferencing, telemedicine, contact-center, or any product that can dial a real phone number, your modern audio engine — usually Opus over WebRTC — has to hand the call off to a much older world at the gateway, and that hand-off forces a transcode into one of these speech codecs. This article is for a product manager, founder, or operations lead with no audio background: by the end you will understand what a speech codec is, why these four still dominate telephony in 2026, how their sound quality and bitrate compare, and where each one quietly sets a quality ceiling on calls your users make. Every number here traces back to its controlling standard — the ITU-T Recommendations and 3GPP technical specifications that define each codec — not a secondhand summary.
What a "speech codec" is, and why it differs from a music codec
A codec is an agreed method for compressing sound on one end of a link and reconstructing it on the other. A speech codec is a codec built for one specific signal: the human voice. That narrow target is what makes it different from a general-purpose music codec like AAC or Opus, and the difference is worth understanding before the four codecs make sense.
A music codec has to handle anything — a cymbal crash, a bass drop, a full orchestra — so it keeps a wide range of frequencies and spends a lot of bits doing it. A speech codec assumes the input is a voice, and a voice is a far simpler, more predictable signal: it sits mostly between 300 and 3,400 hertz (the range an old telephone carried), and it is produced by a physical system — vocal cords, throat, mouth — that can be modelled mathematically.
That model is the key trick. Most modern speech codecs use a method called linear predictive coding, often in a refined form named ACELP — Algebraic Code-Excited Linear Prediction. Instead of storing the sound wave itself, the codec builds a tiny mathematical model of the speaker's vocal tract for each short slice of audio and sends only the model's settings plus a thin "excitation" signal to drive it. The receiver runs the same model in reverse and reconstructs a voice that sounds close to the original. Because a voice model needs far less data than a raw wave, ACELP codecs carry intelligible speech at remarkably low bitrates — figures like 4.75 kbit/s that no music codec could approach.
This is why speech codecs and music codecs are separate worlds, explained more fully in how audio compression works. A speech codec trades away the ability to carry music for the ability to carry a voice cheaply and robustly. The four codecs below are the ones a video product will still meet, in roughly the order they arrived.
Figure 1. Four decades of speech coding: each codec pushed audio bandwidth higher while bitrate trended lower. EVS is the first to reach fullband 20 kHz quality on a phone call.
G.711: the codec under almost every phone call
G.711 is the oldest and simplest of the four, and the one you are most certain to meet. The ITU-T — the standards arm of the International Telecommunication Union, the body that governs global telephony — approved G.711 on 15 December 1972, and it has been the bedrock of digital telephone networks ever since (ITU-T G.711).
G.711 does almost no compression. It takes a voice, limits it to the classic telephone band of 300–3,400 Hz, measures it 8,000 times per second — a sample rate of 8 kHz — and stores each measurement in 8 bits, for a constant 64 kbit/s (ITU-T G.711). The sample rate is how many times per second the system measures the sound, explained in sample rate. Run the arithmetic out loud:
8,000 samples per second × 8 bits per sample = 64,000 bits per second
64,000 bits per second ÷ 1,000 = 64 kbit/s
The "compression" in G.711 is one clever step called companding — compressing the loud parts and expanding them back on the other side, so 8 bits sound like more. There are two flavours: μ-law (mu-law), used in North America and Japan, which encodes 14-bit linear samples into 8 bits; and A-law, used through most of the rest of the world, which encodes 13-bit samples (ITU-T G.711). The two are not interoperable on the wire, which is why a transatlantic call needs a conversion at the boundary — a small but real pitfall.
Because G.711 is everywhere, it is the codec that two systems fall back to when they cannot agree on anything fancier. In a Session Initiation Protocol call — SIP, the signalling that sets up most VoIP — G.711 carries the fixed RTP payload types 0 for μ-law (PCMU) and 8 for A-law (PCMA), the only audio codecs with permanently reserved numbers (IETF RFC 3551). RTP is the real-time transport protocol that actually carries the audio packets, covered in RTP timestamps and sender reports. For a product team, the lesson is blunt: if your gateway offers nothing else, the call will run on G.711 — full 64 kbit/s, telephone-narrow sound, universally understood.
G.722: the first "HD voice"
Sixteen years later, the ITU-T widened the sound. G.722, approved in November 1988, was the first standard to carry wideband speech — a frequency range of roughly 50–7,000 Hz, double the bandwidth of G.711 (ITU-T G.722). That extra high-frequency range is what makes a voice sound natural rather than "telephone-thin", and it is the reason G.722 became the original HD Voice codec.
G.722 samples the voice at 16 kHz (not 8) and uses a technique called sub-band ADPCM: it splits the signal into a lower and an upper frequency band, then encodes each band with adaptive differential coding — storing the change between samples rather than the samples themselves (ITU-T G.722). It runs in three modes that all fit inside a 64 kbit/s channel: 64, 56, or 48 kbit/s for the audio, with the spare 8 or 16 kbit/s available for a side data channel (ITU-T G.722).
Where do you still meet G.722? It is the default wideband codec on most desk-phone (SIP) systems and many enterprise conferencing bridges, because it delivers a clear quality jump over G.711 for almost no extra complexity and no extra bandwidth — it fits in the same 64 kbit/s slot. If your product connects to a corporate phone system or a hardware conference room, the wideband leg of that call is very often G.722.
Common mistake: assuming "wideband" always means "better on this call"
A frequent planning error is to treat a wideband codec as automatically better for every call. It is not, for two reasons. First, both ends and every box in between must support wideband; a single narrowband gateway in the path drags the whole call back to 3.4 kHz, and the wideband codec gains you nothing. Second, wideband only helps if the audio was captured wide in the first place — upsampling a narrowband recording into G.722 adds no real detail. The practical rule: negotiate the best codec the whole path supports, but design your quality expectations around the weakest link in the chain, not the codec name in your own config.
AMR and AMR-WB: the codecs built for bad radio
The phone network is one thing; a mobile network is harder. Radio fades, hands move, signal drops — and a codec for cellular voice has to survive all of it. That is the job AMR was built for.
AMR — Adaptive Multi-Rate — is the speech codec of 2G and 3G mobile networks, specified by the 3GPP (the body that standardizes mobile networks) in 3GPP TS 26.071. Its defining feature is in the name: it is adaptive, meaning it can change its bitrate on the fly, every 20 milliseconds, to match how good the radio link is at that instant (3GPP TS 26.071). When the signal is strong it spends more bits for better sound; when the signal weakens it drops to a lower bitrate so the call survives instead of breaking up.
AMR is narrowband (8 kHz sampling, the same 3.4 kHz voice band as G.711) and offers eight bitrate modes: 4.75, 5.15, 5.90, 6.70, 7.40, 7.95, 10.2, and 12.2 kbit/s (3GPP TS 26.071). It uses the ACELP voice-model coding described earlier, which is how it fits an intelligible call into as little as 4.75 kbit/s — roughly one-thirteenth of G.711's data.
AMR-WB is the wideband upgrade, and it carries a second name worth knowing: the ITU-T adopted the identical codec as G.722.2, so "AMR-WB" and "G.722.2" are the same thing (3GPP TS 26.190; ITU-T G.722.2). Developed by Nokia and VoiceAge, AMR-WB widens the sound back to 50–7,000 Hz like G.722 but does it far more efficiently, with nine modes from 6.60 to 23.85 kbit/s (3GPP TS 26.190). AMR-WB is the codec behind most "HD Voice" branding on 3G and 4G mobile calls. For a product team bridging into mobile, AMR-WB is the codec your wideband mobile leg most likely runs on before the move to VoLTE.
Figure 2. AMR and AMR-WB scale their bitrate to match the radio link. The codec steps up or down between these modes every 20 ms — more bits when the signal is good, fewer when it fades.
EVS: the modern VoLTE codec that finally caught up to music
The newest of the four is the one inside modern mobile calls. EVS — Enhanced Voice Services — was standardized by 3GPP in 2014 (3GPP TS 26.441) as the voice codec for VoLTE (Voice over LTE) and now VoNR (Voice over New Radio, on 5G). It is the codec your phone uses for an "HD+" call on a modern network.
EVS does three things its predecessors could not, all at once. First, it spans every bandwidth in one codec: narrowband, wideband, super-wideband (up to 14 kHz), and fullband (up to 20 kHz) — and 20 kHz is the full range of human hearing, the same as a music track (3GPP TS 26.441). A voice call on EVS can sound like the person is in the room, not on a phone. Second, it covers a wide bitrate range — its super-wideband modes alone run 9.6, 13.2, 16.4, 24.4, 32, 48, 64, 96, and 128 kbit/s — and at the low end it matches or beats AMR-WB's quality at the same rate (3GPP TS 26.441; Nokia). Third, it is built to survive packet loss with a channel-aware mode and strong packet-loss concealment, the techniques that hide missing audio when the network drops data, covered in packet loss concealment.
One detail makes EVS painless to deploy: it includes an AMR-WB interoperable mode, so an EVS phone can talk to an AMR-WB phone without a quality-wrecking transcode (3GPP TS 26.441). That backward compatibility is why carriers could roll EVS out gradually. For a product bridging into a 2026 mobile network, EVS is the best-sounding leg you can hope to reach — and the reason a modern mobile call can sound dramatically better than the same call a decade ago.
The four compared
| Codec | Year | Standard | Bandwidth | Sample rate | Bitrate(s) | Where you meet it |
|---|---|---|---|---|---|---|
| G.711 | 1972 | ITU-T G.711 | Narrowband (3.4 kHz) | 8 kHz | 64 kbit/s (fixed) | VoIP / SIP fallback, PSTN |
| G.722 | 1988 | ITU-T G.722 | Wideband (7 kHz) | 16 kHz | 48 / 56 / 64 kbit/s | Desk phones, conf. bridges |
| AMR | 1999 | 3GPP TS 26.071 | Narrowband (3.4 kHz) | 8 kHz | 4.75–12.2 kbit/s (8 modes) | 2G / 3G mobile voice |
| AMR-WB (G.722.2) | 2001 | 3GPP TS 26.190 | Wideband (7 kHz) | 16 kHz | 6.60–23.85 kbit/s (9 modes) | 3G / 4G HD Voice |
| EVS | 2014 | 3GPP TS 26.441 | NB → Fullband (20 kHz) | 8–48 kHz | 5.9–128 kbit/s | VoLTE / VoNR, modern mobile |
Table 1. The four speech-codec families a video product still meets in 2026. G.711 is the universal floor; G.722 the wideband desk-phone default; AMR / AMR-WB the cellular legacy; EVS the modern VoLTE codec. Sources: ITU-T G.711, G.722, G.722.2; 3GPP TS 26.071, TS 26.190, TS 26.441.
Why this still matters when your product runs Opus
Your product almost certainly does not encode voice in any of these codecs natively. A modern real-time stack uses Opus over WebRTC, covered in Opus, the open codec that ate WebRTC, because Opus is better than all four at the same bitrate. So why care?
Because the moment your call leaves the internet and touches the phone network — a dial-out to a mobile, a customer joining by phone, a bridge into a legacy PBX — a gateway transcodes Opus into whichever speech codec that network speaks. Transcoding means decoding the audio fully and re-encoding it in the other codec, and every transcode costs a little quality and adds a little delay. The codec on the far side becomes the ceiling: if the call lands on G.711, no amount of Opus quality on your side makes it sound better than narrowband telephone. Knowing which codec each path uses tells you exactly where your audio quality is being capped, and whether a wideband-capable gateway (G.722, AMR-WB, EVS) is worth pursuing for a given route.
Where Fora Soft fits in
We build video conferencing, telemedicine, e-learning, contact-center, and OTT products, and the phone-network bridge shows up in most of them — a patient who joins a telemedicine consult by ordinary phone, a participant dialing into a conference, a contact-center agent on a SIP trunk. The practical engineering is in the hand-off: choosing gateways that negotiate the widest codec a route supports, minimizing the number of transcodes between Opus and the speech codec, measuring the real mouth-to-ear quality that survives the gateway, and keeping a clean G.711 fallback so a call never fails for lack of a shared codec. We have made those telephony-interop trade-offs across conferencing and telemedicine builds since 2005.
What to read next
- How Audio Compression Works: The Four Ideas Behind Every Modern Codec
- Opus: The Open Codec That Ate WebRTC
- Packet Loss Concealment (PLC): Hiding the Missing Frames
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your speech codecs plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Speech codecs - cheat sheet — One page: G.711 vs G.722 vs AMR vs EVS, the bitrate-and-bandwidth numbers, where each codec shows up, and the transcoding-ceiling pitfall when bridging into the phone network.
References
- ITU-T Recommendation G.711, Pulse code modulation (PCM) of voice frequencies (approved 15 November 1988 edition; codec originally approved 1972). The controlling specification for G.711: 8 kHz sampling, 8-bit samples, 64 kbit/s, the 300–3,400 Hz telephone band, and the μ-law / A-law companding laws (μ-law from 14-bit linear, A-law from 13-bit). Read from the ITU-T publication page. https://www.itu.int/rec/T-REC-G.711/
- ITU-T Recommendation G.722, 7 kHz audio-coding within 64 kbit/s (current edition 09/2012; originally approved November 1988). The controlling specification for G.722: 16 kHz sampling, 14-bit uniform PCM input, sub-band ADPCM, the 50–7,000 Hz wideband range, and the 64 / 56 / 48 kbit/s operating modes with auxiliary data channel. Read from the ITU-T summary page. https://www.itu.int/dms_pubrec/itu-t/rec/g/T-REC-G.722-201209-I!!SUM-HTM-E.htm
- 3GPP TS 26.071, Mandatory speech CODEC speech processing functions; AMR speech CODEC; General description. The controlling specification for narrowband AMR: eight source rates from 4.75 to 12.2 kbit/s, 8 kHz sampling, 20 ms frames, MR-ACELP coding, and bitrate switching every 20 ms. https://www.3gpp.org/dynareport/26071.htm
- 3GPP TS 26.190, Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; Transcoding functions. The controlling specification for AMR-WB: nine modes from 6.60 to 23.85 kbit/s, 16 kHz sampling, 50–7,000 Hz wideband, ACELP. https://www.3gpp.org/dynareport/26190.htm
- ITU-T Recommendation G.722.2, Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB) (2003). Confirms that ITU-T G.722.2 and 3GPP AMR-WB are the identical codec; cited to establish the dual naming. Where the 3GPP and ITU-T texts describe the same codec, the 3GPP TS is treated as the primary owner for mobile context. https://www.itu.int/rec/T-REC-G.722.2/
- 3GPP TS 26.441, Codec for Enhanced Voice Services (EVS); General overview. The controlling specification for EVS: narrowband / wideband / super-wideband (14 kHz) / fullband (20 kHz) operation, the super-wideband bitrate set (9.6–128 kbit/s), channel-aware mode, and the AMR-WB interoperable mode. https://www.3gpp.org/dynareport/26441.htm
- IETF RFC 3551, RTP Profile for Audio and Video Conferences with Minimal Control (H. Schulzrinne, S. Casner, July 2003). Source for the statically assigned RTP payload types: PT 0 = PCMU (G.711 μ-law), PT 8 = PCMA (G.711 A-law); also the static type for G.722. Used for the SIP / RTP fallback discussion. https://www.rfc-editor.org/rfc/rfc3551.html
- Nokia, The 3GPP Enhanced Voice Services (EVS) codec (white paper). First-party deployer source for EVS quality positioning: at low bitrates EVS matches or beats AMR-WB; at higher bitrates it delivers near-transparent fullband audio; channel-aware coding and improved packet-loss concealment. Used only for quality framing; all numeric codec facts come from 3GPP TS 26.441, which overrides any vendor figure where they differ. https://www.nokia.com/
- 3GPP, EVS Codec — Enhanced Voice Services Codec for LTE (3GPP news / project page). Background on why EVS was developed for VoLTE, its mandatory status for super-wideband VoLTE, and the rollout context. Deployment framing only; numeric facts from TS 26.441. https://www.3gpp.org/news-events/3gpp-news/evs-news
- IETF RFC 4867, RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs (J. Sjoberg et al., April 2007). Source for how AMR and AMR-WB are packetized over RTP, used to support the gateway / transport discussion; the codec definitions themselves come from 3GPP TS 26.071 and TS 26.190. https://www.rfc-editor.org/rfc/rfc4867.html


