Real-time transcription runs streaming automatic speech recognition (ASR) against the live audio of a consult, emitting partial text hypotheses within hundreds of milliseconds and refining them as more context accumulates. Unlike batch transcription that processes a finished recording, it produces usable text while people are still talking, which is what makes live captions, AI scribes, and on-the-fly assistance possible.
Architecturally, the audio is usually forked — at the SFU (the media server that routes streams) or on the client — into the ASR engine, so that recognition observes the conversation without sitting in its critical path. That separation is a hard requirement: the transcription pipeline must not add latency or jitter to the call it is listening to. Patients and clinicians will tolerate imperfect captions; they will not tolerate a transcription feature that degrades the consult itself.
For a telemedicine product team, real-time transcription matters because it is upstream of nearly every language-based feature: captions for accessibility, scribe-generated notes, live translation, and visit summaries all consume its output. That makes its accuracy and latency the floor for the entire AI feature set built on top — garbage in propagates everywhere. It also means the audio is protected health information (PHI), so the ASR engine must process it inside the compliance boundary under a BAA. The common mistake is choosing an ASR engine purely on word-accuracy benchmarks while ignoring whether it handles medical terminology and accented speech in the noisy, variable audio conditions of real consults.

