Why this matters
If you ship a product that carries sound — a video conferencing app, a streaming service, a telemedicine platform, an e-learning tool — you change the audio pipeline constantly: a new codec, a different bitrate, a noise-suppression model, a new jitter buffer setting. Every one of those changes can make audio better or worse, and "worse" is the thing your users notice and complain about first. The honest way to know is to ask human listeners, but a proper listening test costs weeks and money, so you cannot run one on every pull request. Objective metrics fill that gap: they let an automated test flag a regression before it reaches a customer. This article is for the product manager or founder who needs to understand what these numbers can and cannot promise, and for the engineer who has to choose one and defend the choice. By the end you will know which metric fits speech, which fits music, which is free, which needs a licence, and the three mistakes that turn a green dashboard into false confidence.
The thing every metric is trying to copy: a Mean Opinion Score
Before any software, there was a room full of people with headphones. The standard way to measure audio quality is still to play clips to a panel of listeners and ask each one to rate the clip on a five-point scale: 5 is excellent, 4 good, 3 fair, 2 poor, 1 bad. You average all the scores for a clip and the result is its Mean Opinion Score, almost always written MOS. The method is defined by the ITU — the International Telecommunication Union, the United Nations body that standardises telecommunications — in its recommendation ITU-T P.800, and the rating procedure is called Absolute Category Rating, or ACR, because each listener rates each clip on its own against the absolute scale rather than comparing two clips.
A MOS is the ground truth. It is also slow and expensive: you need a quiet room, calibrated headphones, enough listeners that one person's bad day averages out, and hours of their time. A newer recommendation, ITU-T P.808, extends the method to crowdsourcing — paying many remote workers small amounts to rate clips on their own devices — which is faster but adds its own noise from uncontrolled rooms and hardware. Either way, a listening test is something you run a few times a year, not on every code change.
Every objective metric in this article exists to predict a MOS without the room full of people. The metric's whole job is correlation: when the metric says 4.2, the humans should have said about 4.2. So the right question about any metric is never "is the number high?" — it is "how well does this number track what real listeners would say, for the kind of audio and the kind of damage I care about?" Keep that question in mind; it is the thread through everything below.
Figure 1. Objective metrics exist to predict the Mean Opinion Score a human panel would have given, fast enough to run automatically.
Full-reference vs no-reference: the most important distinction
There are two fundamentally different ways to score a clip, and choosing the wrong one wastes weeks.
A full-reference metric — sometimes called intrusive — needs two files: the clean original (the reference) and the processed version you want to grade (the degraded signal). It lines them up in time and measures how far the degraded version drifted from the original. PESQ, POLQA, ViSQOL, and PEAQ are all full-reference. They are precise because they know exactly what the audio was supposed to sound like, but they only work when you have the clean original — which you do in a lab or a build pipeline, and do not in a live production call.
A no-reference metric — also called non-intrusive — judges a single clip with no original to compare against, the way a human can tell a phone call sounds bad without ever hearing the studio version. These are newer and almost all built from machine learning. They are the only option for monitoring live traffic, where there is no clean reference to be had. We come back to them at the end, because in 2026 they are where the field is moving.
The rule of thumb: use full-reference metrics in your build pipeline and lab, where you control the source; use no-reference metrics to watch live production. Most teams need both, for different jobs.
Figure 3. The first question is not which metric, but which kind: do you have the clean original to compare against?
PESQ — the telephone workhorse that refuses to die
PESQ stands for Perceptual Evaluation of Speech Quality. It was published by the ITU as recommendation ITU-T P.862 in 2001 and became, for two decades, the default way to grade speech codecs and telephone networks. If you have ever read a paper that scored a voice codec, it almost certainly used PESQ.
PESQ is full-reference. It takes the clean reference and the degraded speech, runs both through a model of the telephone handset and the human ear, lines them up in time, and measures the audible disturbance between them. The raw output is then mapped onto the familiar 1-to-5 MOS scale by a companion recommendation, P.862.1, producing a number called MOS-LQO — Mean Opinion Score, Listening Quality, Objective. The "objective" tag distinguishes a machine-predicted MOS-LQO from a human-rated MOS-LQS (subjective).
PESQ was built for the narrowband telephone band — roughly 300 to 3,100 Hz, the thin sound of a traditional phone call. A later extension, P.862.2, stretched it to wideband (50 to 7,000 Hz) for HD-Voice calls, but only for 16 kHz files. There is a subtlety here that catches people: in 2018 the ITU issued Corrigendum 2, which admitted the wideband filter coefficients had been wrong and were systematically under-predicting quality by about 0.8 MOS — almost a full point on a five-point scale. Yet because no popular open-source implementation ever shipped the corrigendum, the audio community kept using the uncorrected version. The practical lesson is brutal and specific: two papers that both say "PESQ" can be running different math and are not comparable. Always record the exact version and implementation you used.
Here is the headline fact about PESQ in 2026, and it is a strange one. The ITU withdrew PESQ in 2018 in favour of its successor POLQA, and on 5 January 2024 it formally deleted the entire P.862 family from its catalogue. The standard is, officially, gone. And yet a Google Scholar search for "PESQ" returns over 4,600 papers for 2024 alone, against barely a hundred for POLQA. PESQ survives because it is free, has many open implementations, and is what everyone's old results used — so new work keeps using it to stay comparable. You will meet PESQ for years yet. Just know you are using a deprecated, deleted standard, and treat its numbers accordingly.
Pitfall — optimising directly for PESQ. Because PESQ is cheap and differentiable-ish, teams have trained noise-suppression and speech-enhancement models to maximise the PESQ score itself. This is a textbook case of Goodhart's law: when a measure becomes a target, it stops being a good measure. Researchers have built models that push PESQ up while real listeners rate the output worse — the model learns to please the metric, not the ear. One 2024 study (memorably titled "The PESQetarian") found a system optimised on PESQ correlated with human scores at only about 0.28, near useless. Use a metric to catch regressions, never as the thing your model is trained to win.
POLQA — PESQ's official replacement
POLQA — Perceptual Objective Listening Quality Analysis — is the ITU standard P.863 that was created specifically to replace PESQ. It is also full-reference, also outputs MOS-LQO, and is designed for the same job: grading transmitted speech. But it fixes PESQ's biggest weaknesses.
The first fix is bandwidth. PESQ tops out at wideband; POLQA covers the full range modern audio uses — narrowband (300–3,400 Hz), wideband (50–7,000 Hz), super-wideband (50–14,000 Hz), and even full-band (20–20,000 Hz). For super-wideband and full-band the difference is so small that POLQA scores them as equivalent, but it means POLQA can grade the crisp, wide audio of a modern VoLTE or VoIP call where PESQ simply cannot reach.
The second fix is time-warping, and it matters more than it sounds. Modern codecs and jitter buffers deliberately stretch or compress audio slightly to recover from network hiccups — a process we cover in our piece on the NetEQ jitter buffer. PESQ mistakes that harmless stretching for damage and hands back a pessimistically low score. POLQA tracks the warp and scores it the way a listener actually hears it: a 1-to-5% speed change is inaudible and POLQA correctly reports no quality loss, where PESQ would have penalised it.
The current edition is POLQA v3 (the 2018 revision of P.863), which added full-band analysis and better handling of delay jitter for VoLTE, 5G, and over-the-top calling apps. So if POLQA is better in every technical respect, why does PESQ still dominate? One word: licensing. POLQA is patented and sold under licence by OPTICOM and partners; there is no free reference implementation to drop into a hobby project or an open paper. That single fact explains the whole strange situation — the deprecated free metric out-uses its superior paid replacement by roughly forty to one. If you are a funded product team grading a real voice product, POLQA is usually worth the licence. If you are publishing research or building a side project, you will reach for something free.
ViSQOL — the open metric for speech and music
ViSQOL — Virtual Speech Quality Objective Listener — is Google's open-source full-reference metric, and for many teams it is the practical answer to "I need a free metric that is not a deleted standard." Its current release, ViSQOL v3 (2020), is shipped as a permissively licensed (Apache 2.0) C++ library with a Python binding, which is exactly why it shows up in build pipelines.
ViSQOL works differently from PESQ and POLQA, and the difference is worth understanding because it explains both its strengths and its limits. Instead of modelling a telephone handset, ViSQOL turns both the reference and the degraded clip into a spectrogram — a picture of the sound, with time across and frequency up — and then measures how similar the two pictures are, patch by patch. The similarity measure is called NSIM (Neurogram Similarity Index Measure), and it was adapted directly from SSIM, the structural-similarity index that the image world uses to compare two pictures. So ViSQOL grades audio by treating it as an image-comparison problem. A final fitting step translates the average similarity into a MOS-LQO on the 1-to-5 scale.
ViSQOL has two modes, and picking the wrong one gives nonsense:
- Speech mode expects 16 kHz wideband input, runs voice-activity detection so silence is not scored, and is scaled so a perfect match maps to 5.0.
- Audio mode expects 48 kHz input and is tuned for music and general audio, not just voice. This is the mode that makes ViSQOL special — it is one of the few accessible metrics that grades music sensibly. Google sometimes refers to this music capability as ViSQOLAudio; v3 folded both into one tool.
Because it is similarity-based rather than telephone-modelled, ViSQOL handles time-warping gracefully — Google's own benchmarking found it degrades smoothly under clock drift where PESQ collapses — and it travels beyond the phone-call world into codec evaluation, music streaming, and even as a rough proxy for generative-audio models. Its own documentation is admirably honest about the limits: single scores are noisy, so you aggregate over many clips; it was trained on degraded audio down to about 24 kbps and behaves poorly below that; and it can mislead on use cases far from its training data, like heavy denoising.
Figure 2. PESQ and POLQA model the telephone and the ear; ViSQOL turns sound into a picture and compares the pictures. All three output a predicted MOS on the 1-to-5 scale.
PEAQ and "AAC-Q" — grading music codecs, not phone calls
PESQ, POLQA, and ViSQOL grew out of telephony, where the content is speech. But if your product streams music or high-fidelity audio, you care about a different question — not "can I understand this voice?" but "can a trained listener hear the difference between this compressed file and the original master?" That is the territory of PEAQ.
PEAQ — Perceptual Evaluation of Audio Quality — is defined by ITU-R BS.1387, first published in 1998 and last revised in 2023 (note the ITU-R, the radiocommunication sector, rather than the ITU-T telecom sector that owns PESQ and POLQA). PEAQ was built to grade perceptual audio codecs like MP3 and AAC. It is full-reference, it simulates the masking behaviour of the human ear, and it outputs a number called the Objective Difference Grade, or ODG, on a scale from 0 (imperceptible difference from the original) down to −4 (very annoying difference). PEAQ comes in two flavours: a Basic version fast enough for real-time monitoring, and an Advanced version that is slower but more reliable. It is the closest thing the music world has to a standard codec-quality yardstick, though a 2022 academic review ("Can we still use PEAQ?") found its accuracy has aged against modern codecs and recommended caution.
A word on the "AAC-Q" in this article's title, because it deserves honesty. There is no single ITU or ISO standard called "AAC-Q." In practice the phrase is used loosely to mean perceptual quality measurement applied to AAC and similar music codecs — the job that PEAQ, ViSQOL's audio mode, and proprietary vendor tools (such as Fraunhofer's AQuA) actually do. If a colleague says "run AAC-Q on the new encoder," they mean "score the AAC output against the original with a music-capable perceptual metric." Treat "AAC-Q" as a category, not a product. The real tools are PEAQ/BS.1387, ViSQOLAudio, and licensed analysers — pick by whether you need a standard (PEAQ), something free (ViSQOL audio mode), or vendor support (AQuA and friends).
A worked example: reading the numbers
Numbers on a quality scale only mean something once you have plugged them in once. Suppose you are testing a new Opus encoder setting for a video conference product, and you run the same five-second speech clip through three configurations, scoring each with a wideband full-reference metric (POLQA in super-wideband mode). You get:
- Reference (uncompressed): the metric is not run on the reference against itself; it defines the top.
- Config A — Opus at 32 kbps: MOS-LQO 4.3
- Config B — Opus at 16 kbps: MOS-LQO 3.8
- Config C — Opus at 16 kbps with 5% simulated packet loss: MOS-LQO 3.1
How do you read this? Start with the scale: a MOS difference of about 0.2 is roughly the smallest gap that is reliably perceptible to listeners, so the difference between 4.3 and 3.8 (a gap of 0.5) is real and audible — halving the bitrate cost you half a point. The drop from 3.8 to 3.1 under packet loss (a gap of 0.7) is larger still, which tells you your packet-loss concealment, not your bitrate, is the bigger quality lever here — worth reading our piece on packet loss concealment.
Now the arithmetic that keeps you honest. A single clip's score is noisy, so you never decide on one number. Say you run the test over 40 clips and Config B averages 3.80 with a standard deviation of 0.40 across clips. The standard error of the mean is the standard deviation divided by the square root of the sample count:
standard error = 0.40 ÷ √40 standard error = 0.40 ÷ 6.32 standard error = 0.063
So Config B's average is 3.80 ± about 0.13 at roughly 95% confidence (two standard errors). Config A at 4.30 sits far outside that band, so you can trust that A genuinely beats B. If two configs were 3.80 and 3.85, that 0.05 gap would be inside the noise and you would have proven nothing. The metric gives you a number; the statistics tell you whether the number means anything. Aggregate, compute the error, and only believe differences that clear it.
A comparison table
The table below is the one-screen summary. The tinted column marks the metric we most often reach for first in a general video-product pipeline that carries both speech and music.
| Criterion | PESQ (P.862) | POLQA (P.863) | ViSQOL v3 | PEAQ (BS.1387) |
|---|---|---|---|---|
| Standards body | ITU-T | ITU-T | Google (open) | ITU-R |
| Status in 2026 | Withdrawn 2018, deleted Jan 2024 | Current | Actively maintained | Current (rev. 2023) |
| Reference type | Full-reference | Full-reference | Full-reference | Full-reference |
| Best for | Speech / telephony | Speech / telephony | Speech and music | Music codecs |
| Frequency range | NB to WB (≤7 kHz) | NB to full-band (≤20 kHz) | WB speech / 48 kHz audio | Full-band |
| Output | MOS-LQO 1–5 | MOS-LQO 1–5 | MOS-LQO 1–5 | ODG 0 to −4 |
| Time-warp robust | No | Yes | Yes | N/A |
| Cost | Free | Licensed (OPTICOM) | Free (Apache 2.0) | Standard; impls vary |
| Drop-in for CI | Yes (Python wrappers) | Needs licence | Yes (C++/Python) | Possible |
How to use a metric in CI without fooling yourself
The point of an objective metric is to catch a regression automatically. Here is the discipline that makes it trustworthy, in order.
First, pick the metric that matches your content. Speech-only product (a calling app, telemedicine): POLQA if you can licence it, ViSQOL speech mode if not. Music or mixed content (a streaming service): ViSQOL audio mode or PEAQ. Using a narrowband speech metric to grade music is a category error that produces confident nonsense — and if you are comparing codecs in the first place, our 2026 audio codec comparison table is the place to start.
Second, build a fixed test set and never let your model see it. Assemble 30–50 reference clips that represent your real content — different speakers, languages, music genres, background-noise levels. Run each candidate build through the pipeline, score every clip against its reference, and store the per-clip scores, not just the average.
Third, decide on a threshold from the statistics, not a guess. Compute the mean and the standard error as in the worked example above. Set your CI gate to fail only when the new build's mean drops below the old build's mean by more than the combined error — i.e., only on differences the metric can actually resolve. A gate that fires on a 0.03 wobble will be muted within a week and is worse than no gate.
Fourth, never train on the metric. This is the Goodhart trap from the PESQ pitfall, and it applies to every metric here. The moment your loss function is the quality score, your model learns to game the score. Train on real objectives; measure with the metric.
Fifth, back the metric with periodic human listening. Objective metrics drift away from human opinion exactly when your pipeline does something novel — a new neural codec, an aggressive denoiser. Run a small P.808 crowdsourced test (covered in our companion piece on subjective testing — MUSHRA, MOS, A/B, ABX) a few times a year to confirm the metric still tracks reality for your content. If the metric and the humans disagree, the humans win.
The 2026 shift: no-reference, machine-learning metrics
Everything above is full-reference: it needs the clean original. That is fine in a lab and useless on a live call, where the original never existed in clean form at the listener's end. The fast-moving part of the field in 2026 is no-reference metrics built from deep learning, which judge a single clip with no original at all.
Two are worth knowing by name. DNSMOS P.835, from Microsoft, was built to grade noise suppressors and outputs three separate scores — SIG (how clean the speech itself sounds), BAK (how well background noise was removed), and OVRL (overall) — which is far more diagnostic than one number, because it tells you whether your denoiser hurt the voice while killing the noise. NISQA is an open deep-learning model that predicts overall speech quality plus four dimensions — noisiness, coloration, discontinuity, and loudness — so a low score comes with a hint about why. Both were trained on large crowdsourced datasets rated under ITU-T P.808, and both run on a single clip with no reference, which is what makes them usable for monitoring real production traffic.
The trajectory is clear: full-reference metrics like ViSQOL and POLQA stay in the build pipeline where you have the source, while no-reference learned metrics increasingly watch live systems. We go deeper into the learned-model wave in AI in audio for video and neural audio codecs.
Where Fora Soft fits in
We build video conferencing, telemedicine, e-learning, OTT streaming, and surveillance products, and in every one of them audio quality is the thing users judge first — a clinician needs to hear a patient clearly, a student needs every word of a lecture. We use objective metrics the way this article describes: full-reference metrics like ViSQOL in the build pipeline to catch codec and pipeline regressions before they ship, no-reference scores to watch quality on live traffic, and periodic human listening tests to keep the automated numbers honest. They sit alongside the rest of the real-time audio toolkit — see the WebRTC audio pipeline end-to-end and the production audio-problem runbook. The discipline matters more than the metric: a quality gate is only as trustworthy as the statistics and the human checks behind it.
What to read next
- The NetEQ jitter buffer — the brain of WebRTC audio
- Packet loss concealment: hiding the missing frames
- Subjective testing: MUSHRA, MOS, A/B, ABX
CTA
- Talk to a streaming engineer — about wiring objective audio-quality gates into your pipeline.
- See our case studies — real-time and streaming products we have shipped since 2005.
- Download the audio quality metrics cheat sheet — one-page PESQ / POLQA / ViSQOL / PEAQ decision sheet.
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your audio quality metrics plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Audio Quality Metrics Cheat Sheet — One page: which kind of metric to use, a PESQ/POLQA/ViSQOL/PEAQ comparison, the facts that catch people, and the CI discipline.
References
- ITU-T Recommendation P.800, Methods for subjective determination of transmission quality (1996) — defines MOS and the Absolute Category Rating listening test. ITU-T. https://www.itu.int/rec/T-REC-P.800
- ITU-T Recommendation P.808, Subjective evaluation of speech quality with a crowdsourcing approach (2021) — extends P.800 to crowdsourced ACR testing. ITU-T. https://www.itu.int/rec/T-REC-P.808
- ITU-T Recommendation P.862, Perceptual evaluation of speech quality (PESQ) (2001; Corrigendum 2, 2018) — narrowband speech metric; deleted from the ITU catalogue 5 January 2024. ITU-T. https://www.itu.int/rec/T-REC-P.862 (Standards-body primary source; the standard is withdrawn — cited as historical, with the deletion date verified against the ITU catalogue.)
- ITU-T Recommendation P.863, Perceptual objective listening quality prediction (POLQA) (2018, edition 3) — current ITU speech-quality standard; NB/WB/SWB/FB modes. ITU-T. https://www.itu.int/rec/T-REC-P.863 (Standards-body primary source; supersedes P.862 per §4.3.2 hierarchy.)
- ITU-R Recommendation BS.1387, Method for objective measurements of perceived audio quality (PEAQ) (1998; rev. 2023) — Basic and Advanced versions, ODG output. ITU-R. https://www.itu.int/rec/R-REC-BS.1387 (Standards-body primary source for music-codec quality.)
- M. Chinen, F. Lim, J. Skoglund, N. Gureev, F. O'Gorman, A. Hines, ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric, QoMEX 2020. arXiv:2004.09584. https://arxiv.org/abs/2004.09584 (Primary source from the metric's authors; defines NSIM, audio/speech modes, MOS-LQO mapping.)
- A. Hines, J. Skoglund, A. Kokaram, N. Harte, Robustness of Speech Quality Metrics to Background Noise and Network Degradations: Comparing ViSQOL, PESQ and POLQA, ICASSP 2013. https://research.google.com/pubs/archive/41218.pdf (Benchmarks all three on noise and time-warp; source for the time-warp behaviour claims.)
- M. Torcoli, M. M. Halimeh, E. A. P. Habets, Navigating PESQ: Up-to-Date Versions and Open Implementations, 2025. arXiv:2505.19760. https://arxiv.org/abs/2505.19760 (Fraunhofer/FAU source for PESQ version history, Corrigendum 2 under-prediction (~0.8 MOS), and the 2024 Scholar-count comparison vs POLQA.)
- D. de Oliveira, S. Welker, J. Richter, T. Gerkmann, The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement, INTERSPEECH 2024. (Source for the metric-gaming pitfall and the ~0.28 correlation figure.)
- C. K. A. Reddy, V. Gopal, R. Cutler, DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors, ICASSP 2022. arXiv:2110.01763. https://arxiv.org/abs/2110.01763 (Source for the no-reference SIG/BAK/OVRL model.)
- G. Mittag, B. Naderi, A. Chehadi, S. Möller, NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets, INTERSPEECH 2021. arXiv:2104.09494. https://arxiv.org/abs/2104.09494 (Source for the four no-reference quality dimensions.)
- google/visqol README and source, Apache 2.0. https://github.com/google/visqol (Source for the 48 kHz audio-mode / 16 kHz speech-mode behaviour, score ranges, and usage guidance.)
Internet-Drafts and living standards: ITU-T and ITU-R recommendations are revised on their own schedules; the P.862 deletion date and the P.863/BS.1387 edition years above were verified against the ITU catalogue as of 2026-06-07. PESQ is cited as a withdrawn standard, not a current one.


