This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.
Why this matters
If you build or run a telehealth product, captions are the AI feature most likely to be a legal obligation rather than a differentiator — and the one teams most often forget until an access complaint or an audit forces the issue. A patient who is deaf or hard of hearing has a federally protected right to communicate with their clinician as effectively as anyone else, and on a video platform that right often means live, accurate captions. At the same time, every word the captioning engine produces is patient data, so the same compliance discipline that governs the rest of your stack governs the transcription vendor too. A founder or product lead needs to know three things before scoping this feature: when captions are required and what "good enough" means, where the transcript leaves your protected boundary, and how accurate medical speech recognition actually is — because a misheard drug name in a caption is a patient-safety problem, not a typo.
Transcription is not the scribe: keep the two apart
Start by separating two features that share a first step and get blurred constantly. Both begin by turning audio into text. After that they diverge.
The AI scribe, covered in ambient clinical documentation, listens to the visit and drafts a structured clinical note — a tidied, reorganized summary in the format clinicians chart in. It throws away the small talk and keeps the medicine. A human clinician reviews and signs it before it enters the record.
Real-time transcription and captioning does something more literal: it produces the actual words as they are spoken, in order, attributed to whoever said them. Two products come out of that single stream. Live captions are the words shown on screen during the call, in near real time, so a patient or clinician can read what is being said as it happens. A transcript is the same words saved as a written record of the conversation.
The distinction matters because the two features answer different questions. The scribe answers "what is the clinical summary of this visit?" Transcription answers "what exactly was said, word for word?" One is an interpretation; the other is a recording in text form. A product can offer both, and many do — but they have different accuracy bars, different legal drivers, and, as we will see, different rules about what you keep.
How live captioning works
Under the hood, captioning is the front half of the scribe pipeline run in a hurry. The engine that does the work is the same: automatic speech recognition, or ASR — software that converts spoken audio into written words. The difference is that captions have to appear while the person is still talking, which changes the engineering.
Figure 1. The live-captioning pipeline. Audio streams into the recognition engine, which emits words as they are spoken; the same stream feeds the on-screen captions and, optionally, a saved transcript. Every stage sits inside the compliance boundary.
Streaming, not batch. A scribe can wait until the visit ends and process the whole recording at once — that is "batch" transcription. Captions cannot wait. They use streaming ASR, which sends audio to the recognition engine continuously and gets words back in fragments, within a fraction of a second. The trade-off is that the engine commits to words before it has heard the end of the sentence, so it revises as more context arrives. That is why live captions sometimes flicker and rewrite themselves a beat after they appear — the engine heard "I'll start you on metro—" and guessed, then corrected to "metoprolol" once the rest of the word arrived.
Latency is the design constraint. For captions to feel synchronized with the speaker, the words should appear within roughly a second of being spoken. Push the delay much past that and the captions drift out of step with the conversation, which is both distracting and, in a clinical exchange, confusing about who is responding to what. The internals of streaming recognition — which models, how they run live, how the fan-out works on the media server — live in the AI engineering section's streaming ASR article and live-captions on the SFU article; this article stays on the clinical and compliance side.
Speaker labels. Captions read better when they say who is talking — "Patient: I've had the cough for three days," "Dr. Lee: let's listen to your chest." The step that attributes each line to a speaker is called diarization. In a telemedicine call it is easier than in a crowded room, because each participant usually arrives on a separate audio channel — the patient's microphone and the clinician's microphone are already distinct streams the system can label reliably. Get it wrong and a caption can put the patient's symptom in the doctor's mouth, which in a saved transcript is a clinical error, not a cosmetic one.
When captions are required, not optional
Here is the part most product roadmaps miss. For a large share of telehealth platforms, captions are not a feature you choose to add — they are a legal obligation under disability law. Three layers stack up.
Figure 2. What drives the captioning requirement. Effective-communication law sets the duty, WCAG 2.1 AA sets the technical bar, and the ADA Title II rule puts a date on it for public entities.
Layer one — the right to effective communication. Federal disability law gives people who are deaf or hard of hearing the right to communicate with healthcare providers as effectively as everyone else, and requires providers to supply "auxiliary aids and services" — which expressly include captioning — at no cost to the patient. This duty comes from the Americans with Disabilities Act (the ADA, for private providers, at 28 CFR §36.303) and from Section 1557 of the Affordable Care Act (the healthcare anti-discrimination rule, 45 CFR Part 92), whose 2024 final rule reaffirmed it for any provider receiving federal funds. The law also says the provider must give "primary consideration" to the aid the patient asks for. In plain terms: if a deaf patient needs captions to follow the visit, the platform has to be able to provide them, and the patient should not be billed for it.
Layer two — WCAG 2.1 Level AA, the technical bar. When a regulator or a court asks whether your digital service is accessible, the yardstick they reach for is the Web Content Accessibility Guidelines (WCAG), version 2.1, at conformance Level AA. WCAG is a W3C standard; its Success Criterion 1.2.4, "Captions (Live)," requires captions for live audio content and says those captions should identify the speaker and note significant sounds, not just transcribe words. One honest nuance: the WCAG authors note that 1.2.4 was written with broadcast media in mind and was "not intended to require that two-way multimedia calls between two or more individuals" be captioned regardless of need. So a routine one-to-one consult is not automatically a WCAG-1.2.4 obligation on its own — but the effective-communication duty in layer one fills exactly that gap, and webinars, group sessions, and recorded patient education on your platform fall squarely inside the WCAG rule.
Layer three — the ADA Title II deadline, for public entities. If your platform serves a state or local government health program — a county hospital, a public university clinic, a state telehealth service — a 2024 Department of Justice rule under ADA Title II makes WCAG 2.1 AA a hard, dated requirement. The compliance dates were extended by one year on April 17, 2026: public entities serving populations of 50,000 or more must comply by April 26, 2027, and smaller entities and special districts by April 26, 2028. Private telehealth companies are governed by the ADA Title III and Section 1557 duties in layer one rather than this dated rule, but the practical engineering target — WCAG 2.1 AA captions — is the same.
The takeaway for a product team is simple: build the captioning capability in from the start, treat it as a compliance control rather than a feature flag, and you are covered whichever layer applies to your customers. Bolt it on after a complaint and you are doing remediation under pressure. The broader accessibility picture for clinical video — contrast, keyboard access, screen-reader support — is covered in WCAG 2.1 AA for telemedicine video.
The compliance boundary: the transcript is patient data
Now the part that mirrors every other AI feature in this section. The words coming out of the captioning engine are about a patient's health, spoken in a medical visit, tied to an identifiable person. That makes them Protected Health Information — PHI for short, meaning any health data that can be tied to someone. A live caption of a patient describing their symptoms is PHI just as much as the audio it came from.
So the central compliance question from the AI map applies in full: the audio is going to an outside speech-recognition service to be turned into text, so does PHI leave your protected boundary, and if so, who receives it and what have they signed?
Under the US health-privacy law known as HIPAA (the Health Insurance Portability and Accountability Act), any outside company that handles PHI on your behalf is a business associate, and you must have a signed Business Associate Agreement — a BAA — with them before they touch a single record. The BAA is the signed promise a contractor makes before they get a key to the building. The rule is explicit: a business associate is anyone who "creates, receives, maintains, or transmits" PHI for you (45 CFR §160.103), and the agreement is required before they receive any (45 CFR §164.502(e)).
For real-time transcription this means one thing above all: the ASR vendor is a business associate and needs a signed BAA covering your exact use before any consult audio reaches it. The encouraging news is that the serious medical-ASR vendors offer a BAA as standard — it is table stakes in this market. The trap is the same one that catches every AI feature: a vendor offering a BAA and your use being covered by that BAA are two different facts. The free or consumer tier of a general speech API has no BAA, and the browser's own built-in speech-recognition feature typically sends audio to the browser maker's servers with no healthcare contract at all — so it must never be used for a clinical caption.
What you keep is a separate decision
Captions and transcripts diverge sharply on one compliance question: retention. Live captions can be ephemeral. They appear on screen, the patient reads them, and they disappear — nothing is stored. A transcription feature built purely for live accessibility can keep nothing at all, which is the cleanest possible compliance posture: PHI flows to a BAA-covered engine, comes back as text, is displayed, and is never written to disk.
The moment you save the transcript, it becomes PHI at rest and inherits every storage rule — encryption, access control, and the audit log that records who opened it (45 CFR §164.312(b) requires that logging). A saved transcript can also become part of the patient's designated record set (45 CFR §164.501), the formal set of records a patient has a right to see and request — which means it is discoverable, retainable, and subject to the same retention and consent rules as a recording. The mechanics of consent and retention have their own article: patient consent, recording, and data retention.
The design rule that follows is the same minimum-necessary instinct that runs through HIPAA (45 CFR §164.502(b)): decide deliberately whether you need the saved transcript at all. If captions exist only for accessibility, keeping nothing is both simpler and safer. If you save transcripts — for the record, for the scribe to summarize, for quality review — treat them as the sensitive clinical records they are, and get consent for the capture.
How accurate is medical speech recognition?
Accuracy is where transcription gets clinically serious, because a caption is read in the moment and a transcript is trusted later. The headline metric is word error rate (WER) — the share of words the engine gets wrong. General-purpose speech models do well on everyday conversation and stumble on medicine: drug names, anatomy, dosages, and abbreviations are exactly the words a consumer model was not trained on.
This is why medical-tuned models exist, and the gap is large. On clinical dictation benchmarks in 2025, specialized models reported single-digit error rates where general models were far higher — Google Health's research model reported around a 4.6% word error rate on radiology dictation against roughly 25% for a leading general model on the same task, and commercial medical models from vendors such as Speechmatics and Corti reported clinical WER in the 4–7% range. The exact figures move and are vendor-reported, so treat them as direction, not gospel — but the shape is consistent: a medically tuned model is several times more accurate on clinical vocabulary than a general one.
One subtlety matters more than the headline number. The errors that remain cluster in the worst possible place — the clinical entities. A model can post a low overall WER and still misrecognize drug names, diagnoses, and lab values at a higher rate, because those words are rare and easily confused (think "hydroxyzine" versus "hydralazine"). Researchers have started measuring a medical WER specifically for diagnoses, procedures, and drug names for this reason. For a product, the implication is a design rule: measure accuracy on the words that can hurt someone, set a stringent bar for medication names and allergies, and never let an unreviewed transcript stand in for the medical record. A caption that mishears a drug name is not a typo; it is a safety event waiting for a reader who trusts it.
That last point is also where transcription stays on the safe side of the FDA's medical-device line. A captioning or transcription engine records what was said; it does not diagnose, treat, or recommend. As long as it stays descriptive — words in, words out — it is documentation, not the regulated Software as a Medical Device category, the same boundary the scribe article and the clinical-AI safety layer describe in more detail.
Choosing a transcription vendor
Almost every team buys the recognition engine rather than building one — clinical-grade ASR is a deep, data-hungry specialty, and several vendors offer it under a BAA. The job is to pick the engine and own the product around it: the caption display, the consent flow, the retention decision, the speaker labels. The table below compares the realistic buy options on the columns that matter in healthcare.
Figure 3. Medical ASR options compared. Read the last column first: without a BAA, the option cannot touch consult audio at all.
| Vendor / option | Real-time streaming | Medical-tuned model | BAA available? |
|---|---|---|---|
| AWS Transcribe Medical | Yes (WebSocket streaming) | Yes, clinical specialties | Yes — under the AWS BAA |
| Deepgram (Nova-3 Medical) | Yes, low latency | Yes, medical model | Yes — confirm tier |
| Microsoft Azure Speech | Yes | Via custom/medical models | Yes — under Microsoft BAA |
| Google Cloud Speech-to-Text | Yes | Medical/enhanced models | Yes — confirm config |
| AssemblyAI | Yes | Medical vocabulary + PII redaction | Yes — confirm tier |
| Browser built-in speech API | Yes | No | No — never for PHI |
Read the table by its last column first. Every serious option can be put under a BAA — except the browser's built-in recognition, which routes audio to the browser maker with no healthcare contract and must be blocked for clinical use. For the cloud vendors, "BAA available" still requires you to confirm it covers the specific service and tier you call, and to switch on the configuration the BAA requires (encryption in transit and at rest, the right region, logging). Beyond the BAA, weigh three things: whether the medical model genuinely lowers the error rate on your specialty's vocabulary, whether streaming latency is low enough for captions to feel live, and whether the vendor offers helpful extras such as automatic redaction of identifiers. AssemblyAI, for instance, can detect and redact names and other identifiers before the text leaves its environment — useful when a downstream use does not need the raw identifiers.
For multilingual visits, transcription shades into a different feature — real-time translation — which carries its own legal rules about when a machine is acceptable and when a qualified human interpreter is required. That is the subject of medical translation and interpreter augmentation, and it is not the same as captioning a single-language consult.
A common, expensive mistake
The most frequent failure is reaching for the convenient free tool. A developer wires up the browser's built-in speech recognition to ship captions quickly, or a clinician pastes a transcript into a free public AI assistant to "clean it up." Both send PHI to a company with no BAA, and on consumer tiers the data may feed model training. It is a HIPAA violation whether or not anything bad ever happens, because the patient data crossed to an uncontracted vendor.
The fix is boring and absolute. Every engine that can touch consult audio or a transcript is on an approved list, runs under a BAA on the exact covered tier, and the consumer paths — the browser API, free assistants — are blocked. Make the compliant path the easy path inside your product, a captioning toggle wired to the approved vendor, or the convenient shortcut wins. The second-most-common mistake is treating an unreviewed transcript as the medical record: live captions are an accessibility aid and a rough record, not a verified clinical document, and the engine's clinical-entity errors mean a human has to verify anything that becomes part of the chart.
Where Fora Soft fits in
We build the telemedicine and real-time-video platforms that captioning and transcription attach to, and we build them compliance-first: the patient-data path and the accessibility duty before the feature. When a client wants live captions, our work is to route consult audio to a medical-ASR vendor that sits under a BAA, to keep the caption path low-latency enough to feel live, to make captions an accessibility control that satisfies the effective-communication duty and the WCAG 2.1 AA bar, and to make the retention decision deliberately — ephemeral captions where nothing needs to be kept, encrypted and access-controlled transcripts where it does. The recognition model is increasingly a commodity you can buy; placing it correctly relative to the PHI boundary and the accessibility law is the engineering that keeps a launch — and the patients who depend on captions — well served.
What to read next
- Ambient clinical documentation: the AI scribe
- Medical translation and interpreter augmentation
- WCAG 2.1 AA for telemedicine video
Download the real-time captioning & transcription compliance checklist (PDF)
Call to action
- Talk to a telemedicine engineer — book a 30-minute scoping call to talk through your medical speech recognition plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Captioning & Transcription Compliance Checklist — One page: run a real-time captioning or transcription feature through the four gates — the accessibility duty, the PHI boundary and BAA, the medical-accuracy bar, and the retention decision — before you build.
References
- HHS, HIPAA Privacy Rule — business-associate definition and contract requirement, 45 CFR §160.103, §164.502(e). Why any ASR/transcription vendor that receives consult audio needs a signed BAA. https://www.hhs.gov/hipaa/for-professionals/privacy/guidance/business-associates/index.html — tier 1.
- HHS, HIPAA Privacy Rule — minimum necessary standard (45 CFR §164.502(b)) and designated record set (§164.501). The basis for the retention decision and a saved transcript's record-set status. https://www.hhs.gov/hipaa/for-professionals/privacy/guidance/minimum-necessary-requirement/index.html — tier 1.
- HHS, HIPAA Security Rule — audit controls, 45 CFR §164.312(b), and technical safeguards, §164.312(a)(1),(e)(1). Logging, encryption, and access control for stored transcripts. https://www.hhs.gov/hipaa/for-professionals/security/laws-regulations/index.html — tier 1.
- DOJ, ADA Title II Web and Mobile Accessibility Final Rule (28 CFR Part 35), April 2024, adopting WCAG 2.1 Level AA; compliance dates extended one year on April 17, 2026 (≥50,000 population: April 26, 2027; smaller entities: April 26, 2028). https://www.ada.gov/resources/2024-03-08-web-rule/ — tier 1.
- DOJ, ADA Title III effective-communication requirement and auxiliary aids/services, 28 CFR §36.303. The private-provider duty to make aurally delivered information accessible, including captioning. https://www.ada.gov/resources/effective-communication/ — tier 1.
- HHS OCR, Section 1557 of the Affordable Care Act, Nondiscrimination Final Rule, 45 CFR Part 92 (2024) — effective communication and auxiliary aids/services for individuals who are deaf or hard of hearing, at no cost, with primary consideration to the requested aid. https://www.hhs.gov/civil-rights/for-individuals/section-1557/index.html — tier 1.
- W3C, Understanding Success Criterion 1.2.4: Captions (Live), WCAG 2.1 (Level AA) — captions for live audio, speaker identification, and the note that the criterion targets broadcast rather than two-way calls. https://www.w3.org/WAI/WCAG21/Understanding/captions-live.html — tier 1 (standard).
- W3C, Web Content Accessibility Guidelines (WCAG) 2.1 Recommendation (2018, maintained) — the conformance standard regulators and courts reference for digital accessibility. https://www.w3.org/TR/WCAG21/ — tier 1 (standard).
- AWS, Amazon Transcribe Medical — HIPAA-eligible automatic speech recognition with real-time WebSocket streaming for clinical specialties; covered under the AWS BAA. https://aws.amazon.com/transcribe/medical/ — tier 4 (BAA-offering deployer).
- Deepgram, Benchmarking medical speech-recognition accuracy for clinical use and Nova-3 Medical model documentation — streaming medical ASR, HIPAA BAA, and the case for measuring error on clinical entities. https://deepgram.com/learn/benchmark-medical-speech-recognition-accuracy-production — tier 4.
- Performant ASR Models for Medical Entities in Accented Speech (arXiv 2406.12387, 2024–2025) and related clinical-ASR benchmarking literature — that clinical-entity errors (drug names, diagnoses) exceed overall WER and warrant a medical-specific metric. https://arxiv.org/abs/2406.12387 — tier 5 (peer-reviewed/preprint).
- AssemblyAI, medical transcription and PII-redaction documentation — medical vocabulary, entity detection, and BAA availability for healthcare workloads. https://www.assemblyai.com/blog/assemblyai-vs-deepgram-medical-transcription — tier 4.
Where sources disagreed, the article follows the controlling rule. Accuracy figures (refs 10–11) are vendor-reported or research benchmarks, presented as ranges and direction, not regulatory thresholds. The BAA requirement, the retention/record-set rules, and the effective-communication duty are stated from HHS, DOJ, and W3C primary sources (refs 1–8), which override any looser vendor framing. WCAG 1.2.4's broadcast-scope note (ref 7) is reported faithfully and balanced against the effective-communication duty (refs 5–6) that covers two-way consults.


