Speaker diarization

Speaker diarization answers the question "who spoke when" by segmenting an audio stream into stretches attributed to each distinct voice. On its own, raw ASR produces an undifferentiated wall of text; diarization labels that text so a transcript can say which lines came from the clinician, which from the patient, and which from a caregiver or interpreter. It is the step that turns a transcription into a usable clinical record of a conversation, and it is what an AI scribe depends on to draft a note that quotes the right person for the right statement.

For a telemedicine product, accurate attribution is not cosmetic — it is clinical. A note must not put the patient's reported symptom in the clinician's mouth, or attribute a medication instruction to the wrong speaker. When the transcript and resulting note become part of the chart, they are PHI under HIPAA, and any vendor performing diarization sits in your PHI chain and needs a Business Associate Agreement.

Diarization has well-known failure modes: overlapping speech, similar-sounding voices, and far-field microphones all degrade it, and a single misassigned segment can flip the meaning of a line. Because of that, the safe design pattern is to treat high-stakes attributions — consent statements, medication and dosage instructions, and follow-up directions — as items that deserve confirmation in the interface rather than silent trust. The common pitfall is assuming that because the words are correct, the speaker labels are too; in multi-party telehealth calls those are separate problems, and the speaker label is the one more likely to be quietly wrong.

Speaker diarization

Related terms

ASR (Automatic Speech Recognition)