ASR, or automatic speech recognition, is the technology that converts spoken audio into written text. It runs in two modes: batch, where a finished recording is transcribed afterward, and streaming, where text appears within a fraction of a second as someone speaks. Streaming ASR with sub-second partial results is what makes live captions and real-time AI scribes possible during a consult, and it is the first stage of any ambient documentation pipeline. The output text is only as good as the audio and the model behind it.
Medical conversations are an unusually hard case for ASR. Drug names, dosages, anatomical terms, and clinical abbreviations are rare words the model may not have learned well; patient accents, two people talking over each other (crosstalk), and a microphone placed across the room all drive the word error rate up — and they do so precisely where mistakes carry clinical consequence. A misheard "fifteen" versus "fifty" milligrams is not a typo, it is a safety event.
For a telemedicine product team this matters in two ways. First, accuracy: never accept a vendor's published accuracy number at face value — those figures usually come from clean, general-purpose audio. Test candidate engines on your own real clinical recordings, with your microphones and your patient population, before committing. Second, compliance: the audio and the resulting transcript are PHI under HIPAA, so any ASR vendor in the path needs a BAA and a clear data-handling story. The common pitfall is shipping captions or notes that look authoritative while quietly carrying transcription errors no one is checking — which is why high-stakes lines still deserve human confirmation.

