Captioning renders the live transcript as readable text inside the call interface — synchronized with speech, attributed to the right speaker, and, where it feeds the medical record, editable after the fact. It is the visible surface of the transcription pipeline: real-time ASR produces the words, and captioning is how those words are presented to participants in time to be useful in conversation.

Why it matters spans accessibility and plain comprehension. Captions serve deaf and hard-of-hearing participants, and accessibility is increasingly a compliance obligation rather than a nicety as WCAG 2.1 AA becomes the enforced bar for healthcare digital experiences. The same captions help non-native speakers follow a clinical conversation and give everyone a fallback when audio degrades — which, on real patient networks, it routinely does. In a telemedicine context where a missed word can be a missed instruction, captioning is also a comprehension and safety aid.

The engineering constraint that defines good captioning is latency. Captions that trail the speaker by more than a couple of seconds stop tracking the conversation and become useless for live dialogue, so the target is tight and non-negotiable. Speaker attribution and on-screen readability (size, contrast, line length) are part of getting it right. The common mistake is treating captions as a passive byproduct of transcription and shipping them with whatever delay the pipeline happens to produce; if they lag, deaf and hard-of-hearing users are effectively excluded from the live conversation even though the feature technically exists.