Published 2026-05-31 · 21 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Almost every useful thing you can do with a multi-speaker recording depends on knowing who spoke. A meeting summary that says "Maria committed to the deadline, not Tom" needs speaker labels. A telemedicine record that separates the doctor's voice from the patient's needs them. A call-centre quality system that measures how long the agent talked versus the customer needs them. Subtitles that name the speaker need them. Pyannote is the most widely used open-source way to produce these labels — its models are downloaded over ten million times a month — and it is the engine that tools like WhisperX call to add speakers to a transcript. This article is for the product manager, founder, or engineering lead deciding whether to self-host pyannote or buy a hosted service. By the end you will understand the pipeline's three stages, the specific ways each one fails, what the accuracy numbers mean for your product, and the cost and compliance trade-offs against a paid API.

What Diarization Is — And What It Is Not

Start with the word itself. Diarization means dividing a recording into segments by speaker and giving each segment an anonymous label like "Speaker A" or "Speaker B". The word comes from "diary" — the tool keeps a diary of who was talking when. It does not put real names to voices, and it does not turn speech into text. Those are two different jobs that people constantly confuse with diarization, so it is worth separating them clearly before going further.

The first job people confuse it with is transcription — turning speech into the words that were said. That is the job of automatic speech recognition, abbreviated ASR, which we covered in the streaming ASR lesson. A transcriber answers "what words were spoken?" A diarizer answers "how many people spoke, and when did each one talk?" The two are complementary: you run them both and join the results, which is exactly what the WhisperX lesson showed — WhisperX times each word, pyannote labels each speaker, and a final step stitches a speaker onto every word.

The second job people confuse it with is speaker recognition (or speaker identification) — matching a voice to a known person, the way a phone unlocks to your voice. Diarization does no such thing. It only knows that "the voice in these segments is the same voice, and it is different from that other voice". Whether "Speaker A" is Maria is something your application decides later, usually by asking a user or matching against an enrolled voice print. Keeping this distinction straight saves a great deal of confusion in product planning.

So diarization sits in the middle: more than voice activity detection (which only knows speech-versus-silence), less than speaker recognition (which knows names). It is the layer that makes a wall of transcript into a readable conversation.

Diagram positioning diarization between voice activity detection and speaker recognition, showing the same audio clip processed three ways: VAD marks speech versus silence, diarization splits it into anonymous Speaker A and Speaker B segments, and speaker recognition attaches real names. Figure 1. Diarization answers "who spoke when" with anonymous labels — more than speech detection, less than naming a known person.

Why Pyannote, And Who Builds It

There are several ways to do diarization, but in the open-source world one toolkit dominates. Pyannote.audio is an open-source Python toolkit for speaker diarization, built by Hervé Bredin and collaborators at CNRS/IRIT in Toulouse, France, and released under the permissive MIT license (Bredin, 2023). Its pretrained models are downloaded more than ten million times a month from the Hugging Face model hub, and it has won or placed in the major diarization challenges — first place at Ego4D 2022 and Albayzin 2022 among them (Bredin, 2023). When a tool needs diarization and the team does not want to send audio to a cloud vendor, pyannote is almost always the answer.

A word on naming, because it confuses newcomers. "Pyannote" is the software library. Inside it, the recommended ready-to-use recipe is a pipeline with a version number — the long-standing production default is speaker-diarization-3.1. In late 2025 the project shipped pyannote.audio 4.0 with a new open model, community-1, and a separate company, pyannoteAI, now sells faster, more accurate hosted models called precision-2 (pyannoteAI, 2025). We will treat 3.1 as the well-understood baseline most teams still run, and explain the newer options where they matter.

The Three-Stage Pipeline, In Plain Language

The cleanest way to understand pyannote is as three models run in sequence, plus a final assembly step. Picture a production line. The raw audio enters on the left, passes through three stations, and a timeline of "who spoke when" comes out on the right. Each station does one job, and — this is the important part — each one can fail in its own specific way, so it pays to understand all three.

The first station is segmentation: a model that listens to short slices of the audio and, for each slice, marks where speech is, where it stops, and crucially where two voices overlap. The second station is embedding: a model that takes each detected voice and turns it into a string of numbers — a kind of numerical fingerprint of how that voice sounds. The third station is clustering: an algorithm that compares all those fingerprints and groups the matching ones together, so every slice of the same voice ends up under the same label. A final assembly step turns those groups back into a clean timeline. Let us walk each one.

Left-to-right pipeline diagram showing audio entering and passing through three pyannote stations — segmentation (sliding 10-second window, finds speech and overlap), embedding (turns each voice into a numerical fingerprint), and agglomerative clustering (groups matching fingerprints) — then a final aggregation step producing a who-spoke-when timeline with Speaker A and Speaker B. Figure 2. Pyannote is three models in a row: segmentation finds the voices, embedding fingerprints them, clustering groups them — then aggregation builds the timeline.

Stage One — Segmentation: Finding The Voices, Even When They Overlap

The pipeline does not look at the whole recording at once. Instead it slides a short window across the audio — in the 3.1 pipeline, a window of about ten seconds that steps forward in small hops — and applies a neural segmentation model to each window (Bredin, 2023; pyannote segmentation-3.0, 2026). Think of it like reading a long document through a small magnifying glass moved a little at a time, rather than trying to take in the whole page in one glance. Working on short windows keeps the problem small and easy: at most a handful of people talk in any ten-second slice, even if dozens appear across a two-hour recording.

For each window, the model outputs, moment by moment, which speakers are active. The clever part is how it represents overlapping speech. Older diarization systems treated each speaker as an independent yes/no switch — speaker 1 on or off, speaker 2 on or off — which made overlapping voices a fragile special case. Pyannote's modern segmentation model, introduced in 2023, instead uses a scheme called powerset classification: it has a dedicated category not just for "speaker 1" and "speaker 2" but for "speakers 1 and 2 talking at once" (Plaquet & Bredin, 2023). Concretely, the segmentation-3.0 model reads ten seconds of audio and classifies every frame into one of seven categories: non-speech, speaker 1 only, speaker 2 only, speaker 3 only, speakers 1+2, speakers 1+3, or speakers 2+3 (pyannote segmentation-3.0, 2026). It can see up to three speakers in a window and up to two talking simultaneously.

This design matters for a practical reason: overlapping speech is where diarization usually breaks, and the powerset scheme handles it directly rather than as an afterthought. The 2023 paper that introduced it reports better accuracy on overlapping speech and more robustness when the audio differs from the training data, while also removing a fiddly tuning knob that the old yes/no scheme required (Plaquet & Bredin, 2023). For a video team, that translates to better speaker labels on the hardest material — meetings and panel discussions where people talk over each other.

One subtlety to keep in mind: because each window is processed on its own, the "speaker 1" in one window is not necessarily the "speaker 1" in the next window. The model is consistent within a window but not across windows. Fixing that — making sure the same human gets the same label everywhere — is the job of the next two stages.

Stage Two — Embedding: Turning A Voice Into A Fingerprint

Once segmentation has found the individual voices in each window, the pipeline needs a way to tell whether the voice in window five is the same person as the voice in window forty. It does this by turning each voice into a speaker embedding — a list of a few hundred numbers that captures the distinctive qualities of how that voice sounds: its pitch, timbre, and vocal-tract characteristics. Two recordings of the same person produce embeddings that are close together in number-space; two different people produce embeddings that are far apart. The embedding is, in effect, a numerical fingerprint of a voice.

Pyannote computes one embedding per speaker per window, and it does something careful here that improves quality. To fingerprint a given speaker, it uses only the audio where that speaker is talking alone — it deliberately skips the moments where someone else is talking over them (Bredin, 2023). This matters because a fingerprint taken from a muddy mixture of two voices would be unreliable; by using clean, single-speaker audio, the embeddings are sharper and easier to group correctly in the next stage. This is a direct payoff of stage one knowing exactly where the overlaps are.

The embedding model itself has evolved. The 2021–2022 pipelines used a model called ECAPA-TDNN from the SpeechBrain toolkit (Bredin, 2023). The 3.x pipelines moved to a ResNet-based embedding from the WeSpeaker project, and the 2025 community-1 model improves speaker assignment further still (pyannoteAI, 2025). You do not need to memorise the model names. The point to carry away is that this stage's only job is to answer "is this the same voice?" reliably, and that everything downstream depends on it getting the answer right.

Stage Three — Clustering: Grouping The Fingerprints

Now the pipeline has a pile of fingerprints — many per recording, scattered across all the windows — and it needs to decide which ones belong to the same person. This is clustering: grouping similar items together without being told in advance how many groups there are. Pyannote uses a method called agglomerative hierarchical clustering (Bredin, 2023).

The method is intuitive. Imagine laying every fingerprint out on a table and repeatedly joining the two closest ones into a pair, then joining the closest pairs, and so on — building clusters from the bottom up, merging the nearest neighbours each round. The process stops when the nearest two clusters are further apart than a set distance — a threshold the pipeline calls δ (delta). Everything merged below that threshold is treated as one speaker; anything beyond it is treated as a different speaker. The number of speakers is not assumed up front; it falls out of where the merging stops. That is why pyannote can handle a recording without being told how many people are in it.

This threshold is the single most consequential dial in the whole system, and it cuts both ways. Set it too loose and the pipeline merges two similar-sounding people into one label. Set it too tight and it splits one person who changed tone — got excited, dropped to a whisper — into two phantom speakers. We will return to this when we discuss tuning, because it is where most real-world accuracy is won or lost.

The Final Step — Building The Timeline

The three stages leave the pipeline with: a per-window map of who was active, and a global label for each fingerprint. The last step stitches these together into a single clean timeline. It counts, for each instant, how many speakers are active, picks the best-matching global label for each, converts the frame-by-frame decisions back into start-and-end times, and fills in very short within-speaker gaps so a single sentence does not get chopped into fragments by a brief breath (Bredin, 2023). The output is a list of segments: Speaker A from 0.0 to 4.2 seconds, Speaker B from 4.0 to 7.8 seconds, and so on — typically written in a standard text format called RTTM that downstream tools can read.

The One Metric That Matters — Diarization Error Rate

To judge whether a diarizer is any good, the field uses a single headline number: diarization error rate, abbreviated DER. Understanding it is essential, because vendors and papers quote it constantly and it is easy to be misled by how it is measured.

DER answers a simple question: across the whole recording, what fraction of the time did the system get the speaker wrong? It is the sum of three kinds of mistake. Missed speech is time when someone was talking but the system labelled it silence. False alarm is the opposite — the system heard a speaker in silence or noise. Speaker confusion is time correctly identified as speech but assigned to the wrong speaker. Add the three, divide by the total speaking time, and you get DER as a percentage. Lower is better, and zero is perfect.

Here is the arithmetic on a toy example. Suppose a one-hour recording contains 50 minutes of actual speech. The system mislabels 2 minutes as silence (missed), invents 1 minute of speaker in the silence (false alarm), and assigns 2 minutes of real speech to the wrong person (confusion):

DER = (missed + false alarm + confusion) / total speech
DER = (2 min + 1 min + 2 min) / 50 min
DER = 5 / 50 = 0.10 = 10%

A 10% DER means that, loosely, one-tenth of the speaking time is mislabelled in some way. That is a respectable real-world number; a DER under about 10% is generally usable, and the very best systems on clean broadcast audio dip below 8%.

The critical caveat — the one that separates honest numbers from marketing ones — is how DER is measured. Two settings change the number dramatically. The first is the forgiveness collar: a small window around each speaker change (often 250 milliseconds on each side) that the scorer ignores, on the grounds that the exact boundary is fuzzy even for humans. The second is whether overlapping speech is scored at all; some reports quietly skip the moments when two people talk at once, because those are the hardest. Pyannote, to its credit, publishes its numbers in the least forgiving setup: no collar, and overlapping speech fully scored (pyannote speaker-diarization-3.1, 2026). That makes pyannote's published DER look higher than a competitor's "with collar, no overlap" number even when pyannote is actually more accurate. When you compare two diarizers, the only fair comparison is one where both used the same collar and the same overlap policy.

Diagram explaining diarization error rate as the sum of three error types over total speech time — a horizontal timeline comparing a ground-truth speaker track against a system output track, with shaded regions marking missed speech, false alarm, and speaker confusion, plus a callout noting how a forgiveness collar and overlap scoring change the number. Figure 3. DER is missed speech plus false alarm plus speaker confusion, divided by total speech — and the collar and overlap-scoring settings can swing the number by several points.

What Pyannote Actually Scores In The Real World

Numbers in the abstract do not help a decision, so here are pyannote's own published results for the speaker-diarization-3.1 pipeline, measured in that least-forgiving setup — no collar, overlap fully scored — fully automatically with no per-dataset tuning (pyannote speaker-diarization-3.1, 2026). The spread tells you something important: diarization accuracy depends enormously on the kind of audio.

Benchmark (audio type) DER% Missed + False alarm % Speaker confusion %
REPERE (broadcast TV) 7.8 4.4 3.5
VoxConverse (web/YouTube) 11.3 7.5 3.8
AISHELL-4 (Mandarin meetings) 12.2 8.2 4.0
AMI (headset-mic meetings) 18.8 13.1 5.7
DIHARD 3 (hard, mixed domains) 21.7 14.3 7.3
AliMeeting (far-field meetings) 24.4 14.4 10.0
AVA-AVD (in-the-wild video) 50.0 26.5 23.4

Source: pyannote/speaker-diarization-3.1 model card, benchmarked 2026; "Full" DER setup (no collar, overlap scored).

Read that table as a map of difficulty. Clean, single-microphone broadcast audio (REPERE) scores under 8% — excellent. General web video (VoxConverse) and meeting audio recorded close to each mouth (AMI headset) land in the usable 11–19% range. Far-field meetings recorded by a single room mic (AliMeeting) and chaotic in-the-wild video with music, crowds, and heavy overlap (AVA-AVD) are much harder, with AVA-AVD reaching a 50% DER — meaning half the time is mislabelled. The lesson for product planning is blunt: your diarization quality is set more by your microphone and your acoustics than by your model. A close mic per speaker beats any amount of model cleverness on a bad room recording.

The newer community-1 model (pyannote.audio 4.0, late 2025) improves on these 3.1 numbers, chiefly by reducing speaker confusion — for example, AMI headset improves from 18.8% to around 17.0%, and the premium hosted precision-2 model goes further still, to roughly 12.9% on the same benchmark (pyannoteAI, 2025). The architecture is the same three stages; the gains come from better-trained segmentation and embedding models.

Pitfall One — The Gated Download Will Surprise You

The first thing that trips up almost every team is not accuracy — it is access. Pyannote's models are free and MIT-licensed, but they are gated on the Hugging Face hub: before you can download them, you must create a free Hugging Face account, visit each model's page, accept its user conditions, and generate an access token that your code passes to the pipeline (pyannote speaker-diarization-3.1, 2026). And it is not one model but several — for the 3.1 pipeline you must accept the conditions for both the segmentation-3.0 model and the speaker-diarization-3.1 pipeline separately. Miss one and the download fails with an unhelpful error.

This is a ten-minute, one-time setup, but it surprises teams who expected pip install to be the end of it, and it has a sharp production edge: a token that expires or a model whose terms change will silently break diarization in a running system. Budget the setup time, store the token as you would any secret, and monitor for the failure. The conditions themselves are mild — the maintainers use them to understand who is using the toolkit and to occasionally email about their paid services — and accepting them does not make the model any less free or open (pyannote speaker-diarization-3.1, 2026).

Pitfall Two — You Rarely Know The Speaker Count In Advance

Pyannote can run with no idea how many people are in the recording — the clustering stage discovers the count on its own. But "can" is not "best". If you do know the number of speakers, telling the pipeline almost always improves accuracy, because it stops the clustering threshold from inventing or merging speakers. The pipeline accepts an exact count or a range:

# If you know exactly two people are present (e.g. a 1:1 telemedicine call):
diarization = pipeline("audio.wav", num_speakers=2)

# If you only know a plausible range (e.g. a small meeting):
diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

The pitfall is assuming the automatic count is reliable on hard audio. On a clean two-person call it usually is. On a noisy group recording it can be off by one or two — splitting a person into two phantom speakers, or merging two quiet people into one. Wherever your product knows the count (a scheduled 1:1, a podcast with a fixed cast, a deposition with a named list), pass it in. Where it does not, give a sensible range rather than leaving it fully open, and plan for the count to be occasionally wrong.

Pitfall Three — Overlapping And Short Speech Are Genuinely Hard

The two failure modes that survive even a well-tuned pipeline are overlapping speech and very short turns. When two people talk at once, the segmentation model can place the overlap but the embedding stage has less clean audio to fingerprint each voice, so confusion rises — visible in the benchmark table as the higher speaker-confusion numbers on meeting and in-the-wild audio. Very short turns — a one-word "yeah" or "right" backchannel — give the embedding almost nothing to work with, and these are often misattributed or dropped.

Two consequences follow for product design. First, do not promise word-perfect speaker attribution on conversational audio full of interruptions and backchannels; set the expectation, in the UI and with stakeholders, that labels are very good but not flawless. Second, in any product where a wrong label has real cost — a medical record, a legal transcript, a compliance log — build a fast way for a human to correct speaker labels. The pyannote maintainers are candid that the open models, while strong, are not perfect, and the honest production posture is to treat diarization as excellent assistance rather than ground truth.

Tuning: The Three Dials That Move Accuracy

If the default pipeline is not accurate enough on your audio, you have two levers, in increasing order of effort. The first and cheapest is hyper-parameter tuning: adjusting three numbers the pipeline exposes — the segmentation threshold θ (theta, how confident the model must be to call something speech), the clustering threshold δ (delta, how close two fingerprints must be to count as the same person), and a short gap-filling duration Δ (how long a within-speaker pause to bridge). The paper's analysis is clear that the clustering threshold δ is usually the most impactful, followed by the gap-fill and then the speech threshold (Bredin, 2023). Tuning these against a small set of your own hand-labelled recordings gave a roughly 7% relative DER improvement on matched audio in the published experiments (Bredin, 2023).

The second, more powerful lever is fine-tuning the segmentation model on your own labelled data. This is the step that delivers the big gains on audio that differs from pyannote's training set — the published experiments show a roughly 17% relative DER improvement on out-of-domain data after fine-tuning, and it has a pleasant side effect: once the segmentation model is adapted to your domain, the gap-filling dial stops mattering, simplifying tuning to just two knobs (Bredin, 2023). The cost is that you need labelled conversations from your domain and a GPU to train on. For a team running a high volume of one specific kind of audio — say, a telemedicine provider with thousands of doctor-patient calls — fine-tuning is often the single highest-return investment in diarization quality.

Self-Host Pyannote Or Pay A Hosted Service

The real decision most teams face is not "pyannote or something else" — it is "self-host the open pyannote models, or pay a hosted diarization service" such as pyannoteAI's own precision-2, AssemblyAI, Deepgram, or a cloud provider's speech service. Here is the comparison that matters.

Criterion Self-hosted pyannote Hosted diarization API
Software cost Free (MIT license) Per-hour or per-minute fee
Accuracy (DER) Strong; tunable on your data Often higher out of the box (e.g. precision-2)
Speed ~40× real time on one GPU Vendor-managed, often faster
Data location Stays on your servers Sent to vendor cloud
Setup effort GPU, models, gated token, tuning An API key
Speaker count Auto or hinted Auto or hinted
Best when On-prem rules; steady volume; you can tune Ship fast; want top accuracy with no ops

Sources: Bredin (2023) for the 40× figure; pyannoteAI (2025) for hosted model positioning.

The speed figure is worth grounding. The published pipeline runs about 40 times faster than real time on a single datacentre GPU, with most of the time spent in the embedding stage (Bredin, 2023). That means one GPU chews through 40 hours of audio per wall-clock hour — cheap for steady batch work.

Worked Example — What Self-Hosting Pyannote Costs

Put real numbers through it. Suppose you need to diarize 1,000 hours of recorded calls per month — a mid-sized call centre or a telemedicine backlog. This is batch work: the files already exist, so only throughput and cost matter.

At a conservative 30× real time (below the published 40×, to leave headroom for loading and overhead), one GPU processes 30 hours of audio per wall-clock hour:

1,000 hours of audio ÷ 30 hours per GPU-hour = 33.3 GPU-hours per month

Rent a suitable cloud GPU at about $1.00 per hour and the raw compute bill is:

33.3 GPU-hours × $1.00/hour = about $33 per month — in compute

That is strikingly cheap, and it is why self-hosting is attractive for steady volume. But the compute number is not the whole cost. Add the one-time engineering to build the pipeline, ongoing maintenance, the GPU you may keep partly idle to absorb peaks, the token management, and the human time to correct labels where it matters. For comparison, a hosted diarization API at, say, $0.30 per hour would bill 1,000 × $0.30 = $300 per month — roughly nine times the pyannote compute, but with no pipeline to build and, often, higher out-of-the-box accuracy. The break-even tips toward self-hosting when volume is steady and high, when the audio must not leave your servers, and when you have an engineer who can run and tune the pipeline. It tips toward a hosted service when volume is low or spiky, when you want the best accuracy with no operations work, or when you need to ship next week.

Common Mistake — Comparing DER Numbers That Were Not Measured The Same Way

The single most damaging mistake we see is comparing diarization accuracy across tools without checking how each number was measured. A vendor advertises "8% DER" and pyannote's model card shows "18.8% DER" on AMI, so the vendor looks more than twice as good. But the vendor measured with a 250-millisecond forgiveness collar and skipped overlapping speech, while pyannote measured with no collar and full overlap scoring. Re-scored on the same lenient settings, pyannote's AMI number drops by several points and the gap largely vanishes. The numbers were never comparable.

The fix is discipline. Before believing any DER comparison, confirm three things were identical for both systems: the collar (in milliseconds), whether overlapped speech was scored, and the exact test set. If a vendor will not state their collar and overlap policy, treat their number as marketing, not measurement. Better still, run both tools on a sample of your audio with one scoring tool and one setting — the open-source pyannote.metrics library computes DER consistently — and compare those. The only DER that predicts your product's quality is one measured on your audio with one honest ruler.

Where Fora Soft Fits In

We build speaker-aware features into the video products our clients ship, and the choice between self-hosted pyannote and a hosted service comes up on most of them. We have used pyannote where speaker separation had to happen on the client's own infrastructure — speaker-labelled records for telemedicine consultations, talk-time analytics for video conferencing, and searchable, speaker-tagged archives for OTT and surveillance. The recurring lesson is that the model is the easy part. The hard part is the integration: extracting clean mono audio at 16 kHz, managing the gated token without breaking a deployment, passing a known speaker count where the product has one, choosing the right scoring setup so accuracy claims are honest, and building the human-correction path for records where a wrong label carries real cost. We pick by the four questions below, not by which tool is fashionable.

How To Choose — Four Questions

The decision comes down to four questions, in order.

First, do you actually need to know who spoke? If you only need a plain transcript of what was said, you need ASR, not diarization — skip pyannote entirely. Diarization earns its complexity only when speaker attribution matters.

Second, must the audio stay on your own servers? For medical, legal, or otherwise regulated recordings that cannot go to a third-party cloud, self-hosted pyannote is often the only compliant option, and its operational cost is the price of that compliance.

Third, do you know the speaker count, and how clean is the audio? A known count and close, clean microphones make even the default pipeline excellent. An unknown count on far-field or noisy audio is where you will need tuning, fine-tuning, or a stronger hosted model — plan for it.

Fourth, do you have the team and the volume? Pyannote rewards steady, high volume and an engineer who can run and tune a GPU pipeline. At low or spiky volume, or with no ML capacity, a hosted service that gives strong diarization behind one API key will be cheaper all-in and far faster to ship.

Top-down decision tree with four sequential diamond questions — need to know who spoke, data must stay on-prem, known speaker count and clean audio, team and volume — routing to three rectangular outcomes: plain ASR only, self-hosted pyannote, or a hosted diarization service. Figure 4. Four questions, asked in order, route most speaker-attribution projects to the right tool.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer about adding speaker-aware transcription to your product → /services/ai-software-development
  • See our case studies in telemedicine, e-learning, OTT, and video conferencing → /cases
  • Download the Pyannote Production Setup Checklist (one page, printable) → Download the checklist

References

  1. Bredin, H. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. Proc. Interspeech 2023 (ISCA Archive, doi:10.21437/Interspeech.2023-105). https://www.isca-archive.org/interspeech_2023/bredin23_interspeech.pdf. Primary source for the three-stage architecture (local neural speaker segmentation on a 5 s sliding window with 500 ms step, overlap-aware local speaker embedding, global agglomerative hierarchical clustering with centroid/UPGMC linkage, and final aggregation), the three pipeline hyper-parameters (θ segmentation threshold, δ clustering-stop threshold, Δ intra-speaker gap fill), the relative-importance ranking of those hyper-parameters, the 40× real-time speed on a single V100 GPU with most time in embedding, the ECAPA-TDNN embedding used in 2.1, the ~7% relative DER gain from hyper-parameter tuning and ~17% from segmentation fine-tuning, the DER decomposition into confusion plus false-alarm-plus-miss, and the MIT license / challenge results (Ego4D 2022, Albayzin 2022). All architecture and tuning claims in this article trace to this paper.
  2. Plaquet, A. & Bredin, H. Powerset multi-class cross entropy loss for neural speaker diarization. Proc. Interspeech 2023; arXiv:2310.13025. https://arxiv.org/abs/2310.13025. Primary source for the powerset multi-class segmentation formulation used by the segmentation-3.0 model — dedicated classes for overlapping speaker pairs rather than independent multi-label switches — and its reported gains on overlapping speech and domain-mismatch robustness, plus the elimination of the detection-threshold hyper-parameter that the multi-label formulation required.
  3. pyannote/speaker-diarization-3.1 model card, Hugging Face, accessed 2026-05-31. https://huggingface.co/pyannote/speaker-diarization-3.1. Source for the benchmark DER table read directly from the card (REPERE 7.8, VoxConverse 11.3, AISHELL-4 12.2, AMI headset 18.8, DIHARD 3 full 21.7, AliMeeting 24.4, AVA-AVD 50.0, each decomposed into false-alarm/miss/confusion), the "Full" least-forgiving scoring setup (no forgiveness collar, overlapped speech scored), the pure-PyTorch change from 3.0, the 16 kHz mono input requirement, the gated dual-model token flow (accept segmentation-3.0 and speaker-diarization-3.1 conditions, create an access token), the num_speakers / min_speakers / max_speakers controls, the RTTM output format, and the >10M monthly downloads.
  4. pyannote/segmentation-3.0 model card, Hugging Face, accessed 2026-05-31. https://huggingface.co/pyannote/segmentation-3.0. Source for the segmentation model's 10 s mono 16 kHz input, its seven powerset output classes (non-speech, three single speakers, three overlapping pairs), the maximum of three speakers per chunk and two simultaneous, the MIT license, the gating, and the fact that the segmentation model alone cannot diarize a full recording (it needs the embedding + clustering pipeline).
  5. pyannoteAI. Community-1: Unleashing open-source diarization and the pyannote.audio 4.0 release notes, 2025, accessed 2026-05-31. https://www.pyannote.ai/blog/community-1. Source for the late-2025 release of pyannote.audio 4.0 and the community-1 open model, its reduced speaker-confusion versus 3.1 (e.g. AMI headset 18.8% → ~17.0%), the premium hosted precision-2 model (~12.9% on AMI headset) and its "exclusive" diarization mode for cleaner reconciliation with speech-to-text timestamps, the hosted-at-cost option, the WeSpeaker-family embedding direction, and the userbase scale (140k registered users, 45M monthly Hugging Face downloads).
  6. Bredin, H. et al. pyannote.audio: Neural Building Blocks for Speaker Diarization. Proc. ICASSP 2020. https://github.com/pyannote/pyannote-audio. Foundational source for the pyannote.audio toolkit, its open-source neural building blocks (voice activity detection, speaker change detection, overlapped speech detection, speaker embedding), and the project's MIT-licensed code.
  7. Bredin, H. & Laurent, A. End-to-end speaker segmentation for overlap-aware resegmentation. Proc. Interspeech 2021. The end-to-end neural segmentation model that pyannote's segmentation stage is built on, and the basis for its overlap-aware handling later generalised by the powerset formulation in reference 2.
  8. Ryant, N. et al. The Third DIHARD Diarization Challenge (DIHARD 3), arXiv:2012.01477, 2020, and Chung, J. S. et al. Spot the Conversation: Speaker Diarisation in the Wild (VoxConverse), Interspeech 2020. https://arxiv.org/abs/2012.01477. The challenge evaluation plans that define the DER scoring conventions (forgiveness collar, overlap policy) referenced in this article, and two of the benchmark datasets in the results table.
  9. Desplanques, B., Thienpondt, J. & Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proc. Interspeech 2020. The speaker-embedding architecture used by pyannote's 2.1 pipeline to compute the per-speaker voice fingerprints described in stage two.