Published 2026-06-04 · 39 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

This article is for the founder or product manager who has decided to build an AI telehealth product — a virtual-visit app, a specialist-consult platform, a behavioral-health service — and now needs to know what the real thing costs, how long it takes, which parts are bought versus built, and where the law draws a hard line. It is equally for the engineer who has read the individual speech and language lessons and wants them welded into one deployable system with named technologies and numbers. It assumes you have met the underlying ideas already, because a capstone assembles rather than re-derives; the cross-links point back to each foundational lesson when you need the detail. By the end you will be able to draw the production architecture on a whiteboard, name the exact 2026 technology in every box, defend the cost per visit to a finance team, sequence the build so a first version ships in weeks, and tell a lawful design from one that a regulator, a malpractice lawyer, or a hospital security review will stop at the door.

What You Are Building, Stated Precisely

Fix the product before any technology. You are building a system that wraps a telemedicine visit from end to end and reduces the clinician's typing to almost nothing. Before the appointment, it talks to the patient — in plain language, by chat or voice — and collects why they are coming, their symptoms, their history, and their medications, then hands the clinician a tidy summary so the visit can start informed. During the appointment, it listens to the video call and drafts the clinical note as the conversation happens. After the appointment, it places that note — once the clinician has read and signed it — into the electronic health record, the software system that stores the patient's chart, usually shortened to EHR. The clinician's job shrinks to talking to the patient and approving good drafts.

The industry has a phrase for this now: the AI becomes "the shell around the visit, not just the stenographer inside it." That shell is exactly the product. A scribe alone writes the note; an intake-plus-scribe system also prepares the visit and closes the loop into the chart, which is why this capstone is a bigger build than the scribe lesson it grows out of, and why it is worth its own architecture.

Two words in the product name carry weight, and both are deliberate scope choices. Intake means the structured gathering of a patient's reason-for-visit, symptoms, history, and medications before the clinician joins — not diagnosis, and not a decision about how urgently they should be seen. Scribe means documentation of what was said and decided — not advice about what should be decided. Hold those two lines and the system stays an assistant that prepares and records the work of a licensed human. Cross either line — let the intake agent tell the patient what condition they have, or let the scribe suggest a diagnosis on its own authority — and you have built something heavier, slower to ship, and, as the compliance section shows, regulated as a medical device.

The Spine: One Rule, Applied Twice

Two ideas carry the entire build. Get them right and everything else is detail.

The first is the draft-and-confirm pattern, and it runs through both halves of the system. The intake agent does not file a diagnosis; it drafts a summary the clinician confirms. The scribe does not write the chart; it drafts a note the clinician signs. In both places a machine produces a draft and a licensed human makes it real. This is not a compliance decoration bolted on at the end — it is the load-bearing wall of the whole design, because three separate things all hang on it: who is legally responsible, whether the product is a regulated device, and whether a fluent-but-wrong sentence ever reaches a real chart. We return to each of those below; for now, hold the rule in one line: the model drafts, the clinician decides, and nothing is filed automatically.

The second idea is the clean per-speaker audio a video visit hands you for free, and it is the reason a telehealth product is the best place to build a scribe rather than the hardest. In a physical clinic, an ambient scribe fights a room microphone, far-field echo, and two or three people talking over each other. A video visit gives you the opposite. Real-time communication on the web — WebRTC for short, the technology that carries the call — sends each participant on a separate media track, so the clinician's voice and the patient's voice arrive as two distinct, close-miked streams. That single fact buys three things at once: near-studio audio per person, so transcription is more accurate; diarization — the task of labeling who said which words — almost for free, because each speaker is already a separate stream rather than a voice the software must untangle after the fact; and a natural place to both capture the audio and ask for consent, which is the call itself. The mechanics of tapping that audio are the subject of the SFU-side ASR fan-out lesson.

Hold the two ideas together and the platform has a clean shape. The draft-and-confirm pattern decides what the machine is allowed to do on its own — draft, never decide. The clean-audio fact decides where the scribe taps in — the call you already run, not a microphone you bolt on. Everything in the rest of this article fills in the boxes between those two ideas, decides what you build versus buy, and prices it.

Production deployment diagram of a telemedicine intake and AI scribe system, read left to right across three phases. Phase one, before the visit, shows a conversational intake agent that talks to the patient by chat or voice, collects symptoms, history, and medications, and produces a structured pre-visit summary that the clinician confirms. Phase two, during the visit, shows a WebRTC video call whose two per-speaker audio tracks, clinician and patient, are tapped by a server-side bot or the media server, passed to automatic speech recognition and diarization, and then to a two-step structuring stage that first extracts clinical facts tied to transcript spans and then generates a SOAP note. Phase three, after the visit, shows the clinician reviewing and signing the draft note, after which a FHIR write-back service files the signed note into the electronic health record. A consent-and-audit band runs underneath all three phases, capturing patient recording consent before any audio is tapped and logging every step. A footer line reads the model drafts; the clinician decides; nothing is filed automatically.

Figure 1. The production deployment. Intake prepares the visit, the scribe taps the call's clean per-speaker audio, and a signed note flows into the EHR — with consent captured up front and every step logged.

The Production Architecture, Box By Box

A real deployment is more than a speech model and a language model. Eight kinds of component show up in every intake-and-scribe system we have scoped, and naming them precisely is the first hour of any project.

The intake agent is the patient-facing front door. It is a conversational system — chat or voice — that asks the patient why they are coming and follows up based on their answers, the way a good nurse intake call adapts its questions. In 2026 this is built on a language model with a careful script and guardrails, and its only output is a structured summary for the clinician, never a diagnosis for the patient. Vendors such as Paratus Health (founded 2024) and the conversational-triage engine Infermedica occupy this box; Teladoc rebuilt its own pre-visit interview along these lines through 2025 and 2026. A well-built intake stage pre-populates the chart and, in published deployments, trims five to ten minutes off the visit.

The video and capture layer is the visit itself and the tap into it. The call runs on a WebRTC stack with a selective forwarding unit — the media server, or SFU, that routes each participant's stream in a group call. A server-side participant that joins the session, or a hook in the SFU, captures exactly the two audio tracks the scribe needs. You usually build this layer on a real-time platform rather than from raw sockets; the build-versus-buy of the call itself is the subject of the conferencing capstone.

The speech recognition layer turns each audio track into timestamped text. Automatic speech recognition — the technology that turns speech into text, abbreviated ASR — must do more here than ordinary dictation: it has to get drug names and dosages right, expand clinical abbreviations, and cope with accents. In 2026 the production choices are medical-tuned engines such as Deepgram Nova-3 Medical and AssemblyAI's Universal model with its medical mode, or a self-hosted Whisper-family model where data must stay inside. The engineering of production speech recognition is its own subject, covered in the streaming ASR lesson.

The diarization layer labels who said which words. Because a video call already separates speakers onto their own tracks, much of this is free, but you still align and merge the tracks into one speaker-labeled transcript. WhisperX and Pyannote are the standard tools, covered in the WhisperX lesson and the Pyannote lesson.

The structuring layer turns a raw transcript into a clinical note, and it is built in two steps for safety, not one. First a model extracts the clinical facts — symptoms, findings, medications, the plan — and ties each one to the exact span of transcript it came from. Then a language model writes the note from those extracted facts, usually into the standard SOAP format: Subjective (what the patient reports), Objective (what the clinician observes), Assessment (the working diagnosis the clinician stated), and Plan (next steps). Extracting first and generating second is what lets every line in the finished note trace back to something that was actually said, and we return below to why that ordering is the difference between safe and unsafe.

The review-and-sign layer is where a human touches everything. The clinician reads the intake summary before the call and corrects it; reads the note draft after the call, edits anything wrong, and signs it. This is ordinary web-application territory — a draft view, an edit box, a signature action — and it is deliberately the most carefully designed screen in the product, because it is the gate that makes the output trustworthy and lawful.

The EHR integration layer files the signed note into the chart. The modern path is FHIR — Fast Healthcare Interoperability Resources, the standard healthcare data API — usually via SMART on FHIR, the authorization framework that lets an outside app launch inside the EHR with the right permissions. The note travels as a FHIR DocumentReference resource. This box has a famous trap we cover in the production-concerns section: a naïve file call lands the note in the wrong place.

The consent, audit, and security layer runs underneath all of the above. It captures the patient's consent to be recorded before a single second of audio is tapped, logs every step so the path from spoken word to signed note can be reconstructed later, encrypts patient data in transit and at rest, and enforces who may see what. For a healthcare product this layer is not optional infrastructure — it is what lets the system be sold to a hospital at all.

Build Versus Buy: The 2026 Verdict, Component By Component

A capable team does not write all of this from scratch, and does not buy all of it either. The line in 2026 sits in a fairly stable place, with one large new wrinkle this year, and getting it right is the difference between shipping in a quarter and burning a year. The rule of thumb mirrors the other capstones in this course: adopt the mature infrastructure, buy or adopt the fast-moving models, and build only the part that is your actual product — here, the visit experience, the structuring logic, and the safeguards around them.

Component Build or buy Concrete 2026 choice Why
Video / WebRTC call Integrate A real-time platform (LiveKit, 100ms, Daily) Real-time media is a solved, hard domain; never rebuild an SFU
Intake agent Buy or build Infermedica / Paratus-class, or your own scripted LLM Buy to validate; build when the intake flow is your product
Speech recognition Buy or self-host Deepgram Nova-3 Medical / AssemblyAI medical, or Whisper Medical-tuned APIs win on accuracy; self-host only to keep data in
Diarization Adopt open source WhisperX · Pyannote (plus per-track separation) Standard, free; the call already half-solves it
Structuring (extract → SOAP) Build Your prompt + schema on a frontier or open LLM The note logic and its safety are your product
Review-and-sign UI Build Your stack This is the trust gate — it must be yours
EHR write-back Integrate SMART on FHIR + DocumentReference (App Orchard) A standard with sharp edges; integrate, don't reinvent
Consent · audit · security Build on compliant infra HIPAA-eligible cloud + your audit log Your obligation; cannot be delegated away

Two cells deserve a note because they moved in 2026. The speech-recognition choice is now a safety decision with published numbers, not a price comparison: in vendor-reported clinical evaluations, AssemblyAI's medical mode reached a 4.97% missed-entity rate on medical terms against 7.32% for Deepgram's Nova-3 Medical — about a third fewer misses on the words that matter most for patient safety — while Deepgram shipped a Nova-3 update in March 2026 that cut word error rate roughly a third across languages. The lesson is to benchmark on your audio and treat the clinical-term miss rate, not the headline word error rate, as the number that decides.

The build-versus-buy of the scribe itself changed shape in February 2026, when Epic — the EHR used by a large share of US hospitals — launched its own native ambient scribe, branded Art, that drafts notes and orders directly inside the chart with no third-party vendor. Microsoft and Nuance's Dragon Copilot already held the largest share of the standalone market, with Abridge, Ambience Healthcare, Suki, and Nabla behind it; Epic shipping a native option pressed every health system to ask whether a separate scribe contract is still worth it. For a product team the takeaway is sharp: if your customers live inside Epic and want only a generic scribe, you may be competing with a free built-in feature, so the room to win is in what Art does not do — your specific specialty, your intake front door, your own data perimeter, a multi-EHR product, or a visit experience Epic does not own. The per-component reasoning for the speech and language stack is the subject of the streaming ASR lesson and the real-cost-of-AI lesson.

Component build-versus-buy matrix for the telemedicine intake and scribe system. Eight rows list the WebRTC call, the intake agent, speech recognition, diarization, the structuring step, the review-and-sign interface, EHR write-back, and the consent-audit-security layer. For each, a colored tag marks whether you integrate, buy or adopt, or build it yourself, with the concrete 2026 technology named and a one-line reason. Integrate and buy decisions dominate the infrastructure and model rows; build is reserved for the structuring logic, the review-and-sign trust gate, and the consent-and-audit layer. A side note flags that Epic launched a native scribe called Art in February 2026, so a generic scribe inside Epic now competes with a built-in feature.

Figure 2. What to build and what to adopt. Integrate the call and the EHR standard, buy or self-host the models, and build the structuring logic, the trust gate, and the compliance layer — the three things that are actually your product.

Following One Patient From Booking To Signed Note

Numbers and boxes become concrete when you trace a single patient through the system. Follow one visit: a patient books a same-week telehealth appointment for a persistent cough.

The evening before, the intake agent messages the patient. It asks why they are coming, then follows up based on the answers — how long the cough has lasted, whether there is fever, what medications they take, whether they smoke. It is having a guided conversation, not serving a static form, so it can ask the next sensible question instead of all of them. When the patient finishes, the agent writes a structured summary: reason for visit, symptom timeline, current medications, relevant history. It does not tell the patient what is wrong, and it does not decide how urgently they must be seen — those are the two lines from the scope section. The summary lands in the clinician's queue marked as a draft to confirm.

At appointment time, the patient and clinician join the video call. Before any audio is captured, the patient sees and accepts a clear consent notice that the visit will be recorded to draft the note — a deliberate step in the call flow, captured and logged, not a checkbox buried in a sign-up page. Only after consent is recorded does a server-side participant begin capturing the two audio tracks.

As the visit proceeds, each track flows to speech recognition and is merged by diarization into one speaker-labeled transcript. The clinician talks to the patient normally; the scribe is silent and invisible. When the visit ends, the structuring layer runs its two steps: it first extracts the clinical facts and ties each to the transcript span it came from, then writes a SOAP note from those facts. Because the clinician stated the working assessment out loud — "this looks like acute bronchitis, let's treat it as that" — that is what the Assessment line records; the system documents the decision, it does not make it.

The draft note, with the patient's intake summary attached, lands in the review-and-sign view. The clinician reads it, sees each line backed by the words it came from, corrects a misheard dosage, and signs. Only now does the EHR write-back run: the signed note travels as a FHIR DocumentReference into the chart, landing in the right place in the clinician's note-writing surface rather than in an unattached-documents bin. Every step — consent, capture, transcript, extracted facts, draft, edit, signature, write-back — is in the audit log, so weeks later anyone can reconstruct exactly how the note in the chart came to exist.

Notice the discipline. A machine prepared the visit and drafted the note; a licensed human confirmed the summary and signed the note; consent came before capture; and every line traced to evidence. That shape is not decoration — it is exactly what keeps the system fast for the clinician, safe for the patient, and defensible to a regulator.

Numbered patient journey through the intake and scribe system, eight steps. Step one, the night before, the conversational intake agent interviews the patient and drafts a structured summary. Step two, the clinician confirms or corrects that summary before the visit. Step three, at appointment time the patient joins the video call and accepts a recording-consent notice before any audio is captured. Step four, the two per-speaker audio tracks flow to speech recognition and diarization into one speaker-labeled transcript. Step five, when the visit ends the structuring layer extracts clinical facts tied to transcript spans. Step six, a language model generates a SOAP note from those facts, recording the assessment the clinician stated out loud. Step seven, the clinician reviews the draft with each line backed by its source words, edits, and signs. Step eight, the signed note is written into the electronic health record as a FHIR DocumentReference and the full path is saved to the audit log. Lanes group the steps by actor: intake agent, patient, clinician, models, and the record.

Figure 3. One visit, end to end. The machine prepares and drafts; the clinician confirms and signs; consent precedes capture; and every line of the note traces back to evidence.

The Accuracy Problem You Must Design Around

A note that reads perfectly and contains a drug the patient never mentioned is more dangerous than an obviously messy one, because it invites trust it has not earned. Three failure modes matter in this system, and the architecture has to be built around all three.

The first is transcription hallucination — the speech model inventing words that were never spoken. This is not hypothetical. A 2024 study presented at the ACM Conference on Fairness, Accountability, and Transparency ran clinical-style audio through OpenAI's widely used Whisper model and found that about 1% of segments contained hallucinated text — fabricated sentences, and in some cases invented medication names or harmful phrases no one had said. At the time Whisper was estimated to be in use by tens of thousands of clinicians, which is how a 1% rate becomes tens of thousands of corrupted transcripts at scale. The researchers noted that several other commercial engines did not show the same behavior — a reminder that the choice of ASR engine is a safety decision, which is exactly why the build-versus-buy table treats the clinical-term miss rate as the deciding number.

The second is the language model's own errors at the structuring stage — not only hallucinations, but omissions, where a real and important detail is silently dropped. Studies of language-model clinical summaries put these in the low single digits; one analysis reported about a 1.47% hallucination rate alongside a 3.45% omission rate. Omissions are sneakier than hallucinations because nothing on the page looks wrong — a fact is simply missing — and a missing allergy or a dropped dose can matter more than an added sentence a clinician would catch.

The third is unique to the intake half of the system: the temptation to let intake triage the patient. Symptom-checking software is far less reliable than it looks. Independent evaluations have found online symptom checkers give the correct urgency recommendation in roughly only 58% of cases, with over-cautious "go to the ER" advice in 20–30% and, more dangerously, under-triage — telling a sick patient to stay home — in 10–15%. Those numbers are why this system's intake agent is scoped to gather, not to decide: it collects symptoms into a summary for the clinician and never tells the patient how urgently they should be seen. The moment intake starts making the urgency call, you have inherited that error rate as patient-safety risk and, as the compliance section shows, very likely become a regulated medical device.

The engineering answer to the first two failure modes is the extract-then-generate ordering from the structuring layer. When the model first pulls out structured facts tied to specific transcript spans, and only then writes prose from those facts, you gain two defenses at once: the note is harder to fabricate from, because it is built from extracted evidence rather than free association, and every line can show the clinician the words it came from, which turns review into a glance instead of a re-listen. The answer to the third failure mode is scope discipline plus the sign-off gate — the subject of the next section.

Two-panel safety diagram. The left panel shows the extract-then-generate pattern as two stages: first, an extract step pulls clinical facts, symptoms, medications, and the plan from the transcript and ties each fact to the exact transcript span it came from; second, a generate step writes the SOAP note only from those extracted facts, so every line in the note carries a pointer back to the words that justify it. A caption notes that doing it in this order makes the note hard to fabricate and fast to review. The right panel shows the sign-off gate as a draft note marked DRAFT, not a record passing through a checkpoint labeled clinician review and sign, beyond which it becomes a solid signed clinical record written to the EHR. Three call-outs hang off the gate: liability, the signing clinician owns the note; regulation, because a human decides, the tool stays an assistant and not a device; accuracy, review is the real safety control that catches a hallucination or omission before it reaches the chart. A footer line reads the model drafts; the clinician signs; nothing is filed automatically.

Figure 4. The two safety patterns. Extract facts with evidence first and generate the note second; then pass every draft through the clinician's sign-off, which is the real safety control, not a formality.

A Cost Model With The Arithmetic Shown

Pricing this system correctly means reasoning per visit, because that is the unit that scales with a growing telehealth practice. The arithmetic is simple, and it decides the build-versus-buy conversation, so do it once. Walk through one visit of about fifteen minutes of conversation.

Start with transcription. Medical ASR APIs in 2026 price in the rough range of a few tenths of a cent to about a cent per minute of audio. Fifteen minutes at, say, $0.005 per minute is:

transcription:  15 min × $0.005/min = $0.075 per visit

Now the intake conversation and the note generation, both of which run a language model. Intake exchanges a few thousand tokens — the chunks of text a language model reads and writes — over the pre-visit chat; structuring the note reads the transcript and writes the SOAP note, together a few thousand more. Call it on the order of 10,000–20,000 tokens across both at 2026 mid-tier model prices, which lands in the low tens of cents:

intake + note LLM:  ~15K tokens × ~$0.000004/token ≈ $0.06–0.20 per visit

Add the per-visit machine cost and it sits around a quarter of a dollar:

per-visit compute:  $0.075 + ~$0.15 ≈ ~$0.23 per visit

Compare the alternatives at the same unit. A human scribe service runs roughly $2,000–$4,000 a month for a clinician seeing about 400 visits a month, which is about $7.50 per visit. An embedded scribe vendor on an unlimited plan around $99 a month works out to roughly $0.25 per visit. Building on your own APIs lands at the compute floor above — low tens of cents — with the vendor margin removed and the data kept inside your perimeter.

Route Time to ship Per-visit cost (illustrative) Who holds the patient data Best when
Human scribe Hire / contract ~$7.50 Your practice You want a person, not software
Embed a vendor Days–weeks ~$0.25 (plan-based) The vendor Validating the feature fast
Build on APIs Weeks–months ~$0.20–0.40 You + API providers You need to own the note and flow
Self-host open models Months Low tens of cents You only Patient data must stay inside

The figures are illustrative and move with vendor plans and token prices; the point is the shape. Every AI route is one to two orders of magnitude cheaper per visit than a human scribe, so the choice among them is decided less by cost than by who has to hold the patient data and how much of the visit experience you need to own. One caution specific to this system: the always-on cost is not zero even when no visit is happening, because the call infrastructure and the HIPAA-eligible hosting run continuously — budget that as a fixed monthly line, separate from the per-visit compute. The full method behind these numbers, including the token math for the language-model step, is in the real-cost-of-AI lesson.

Per-visit cost diagram for one fifteen-minute telehealth visit. On the left, a stack adds up the compute for one visit: transcription at fifteen minutes times half a cent per minute equals about eight cents, plus the intake and note-generation language-model passes at roughly six to twenty cents, totaling about twenty-three cents per visit. On the right, a bar comparison shows four routes at the same per-visit unit: a human scribe at about seven dollars fifty, an embedded vendor at about twenty-five cents, building on APIs at roughly twenty to forty cents, and self-hosting open models at low tens of cents. A note explains that every AI route is one to two orders of magnitude cheaper per visit than a human scribe, so the decision turns on who holds the patient data, not on cost. A second note warns that the call infrastructure and HIPAA-eligible hosting are a fixed always-on cost separate from the per-visit compute.

Figure 5. The per-visit economics. Any AI route costs cents per visit against a human scribe's dollars; the AI routes differ on data ownership, not price. Keep the fixed hosting cost on a separate line.

Common Mistake: Letting The Machine Decide Instead Of Draft

The failure we are called to fix most often on clinical-AI products is not a weak model — it is a good model trusted to decide instead of to draft. Three versions recur, and all three are settled at architecture time, not in tuning.

The first is auto-filing the note to save the clinician a click. A demo where the note comes out clean ninety-nine times in a hundred tempts a team to push it straight to the chart. Then the hundredth note carries a wrong dose or a fabricated symptom into a real record, and a convenience feature becomes a patient-safety incident and a legal exposure. The fix is structural, not a disclaimer: make the signed note the only thing that ever reaches the EHR, make the draft visibly a draft, show the source words behind each line, and never let the system file on its own. Ask of every path, "could this reach the chart without a human signing it?" — and if the answer is yes, that path is the bug.

The second is letting intake triage the patient. Because the intake agent is already talking to the patient about symptoms, it is tempting to have it also say "this sounds urgent, go to the ER" or "this is probably just a cold." Given the real error rates of symptom checkers — correct urgency only around 58% of the time — that single feature converts a helpful data-gathering tool into an unvalidated triage device that can send a sick patient home. Keep intake to gathering; let a licensed human make the urgency and diagnosis calls.

The third is trusting one ASR engine because it sounded good in a demo. Speech recognition that transcribes a clear-spoken founder flawlessly can still drop or mangle drug names in a real accented, cross-talking visit, and in medicine a mangled drug name is a safety event. The fix is to benchmark candidate engines on your own clinical audio, measure the medical-term miss rate specifically, and keep the extract-then-generate evidence trail so the clinician can catch what slips through. Treat the speech layer as a safety component, not a commodity.

All three mistakes share one root: handing the machine a decision that belongs to a licensed human. The whole system is designed so the machine drafts and prepares while the human decides and signs — break that division anywhere and you have built the dangerous version of the product.

The Build Plan: Five Milestones, Value At Every Step

You do not build this all at once, and you do not build it in the order the diagram is drawn. You build it so a working product exists after the first milestone and every later milestone ships independently, which keeps the project fundable and the team motivated.

Milestone 1 — the scribe over your existing call. Tap the per-speaker audio from a visit, run ASR and diarization, structure a draft note with extract-then-generate, and show it in a review-and-sign screen. Ship the ability for a clinician to finish a visit and get a good draft note to sign. No intake and no EHR write-back yet — just the documentation win, which is the single most-wanted feature and the foundation everything else attaches to. If this is shaky, nothing else matters.

Milestone 2 — consent and the audit log. Build the recording-consent step into the call flow and the immutable audit trail behind every step. This is what turns a working demo into something a hospital's privacy office will allow near real patients, and it is far cheaper to build in now than to retrofit once data is flowing.

Milestone 3 — EHR write-back. Add the SMART on FHIR integration that files the signed note into the chart as a DocumentReference, landing in the clinician's note-writing surface. This is where the product stops being a side tool the clinician copies from and becomes part of the chart workflow, and it is where most of the integration effort and the App Orchard certification time lives.

Milestone 4 — the intake front door. Add the pre-visit conversational agent that gathers symptoms, history, and medications into a summary the clinician confirms. The system now wraps the whole visit — the shell around it — rather than only the part during the call, and it reuses the structuring and review patterns you already built. Scope it to gathering, not triage, from the first line of its prompt.

Milestone 5 — specialty depth and scale. Tune the note format and intake script per specialty, add multi-EHR support beyond your first integration, and harden for volume. These come last because they are higher effort and lower risk, and because by now you have a real product in clinicians' hands telling you which specialty and which EHR to do next.

The five-milestone staircase, the reference stack, the build-versus-buy verdicts, the per-visit cost math, and the compliance checklist are all collected on one page in the downloadable blueprint at the end of this article, so a team can pin it to the wall and work down it.

The Hard Part — Compliance And The Device Line

A telehealth system that works in a demo is not a product if it is unlawful to operate where your patients are. For an intake-and-scribe system, four gates stand between a working build and a deployable one, and they are the most important part of this article to read before you write code. This is engineering-relevant context, not legal advice; confirm the specifics with qualified counsel for every market you operate in.

The first gate is consent to record, and telemedicine makes it harder, not easier. Capturing the visit audio is a recording, and recording law is separate from health-privacy law. US federal law sets a one-party-consent floor, but a dozen states — including California, Illinois, Florida, Pennsylvania, and Washington — require all parties to consent before a conversation is recorded. A video visit routinely crosses state lines: the clinician may sit in a one-party state while the patient dials in from an all-party state. The cautious and common practice is to apply the stricter rule, which means explicit patient consent before the scribe starts listening, every time. The stakes are real — in some states, recording without all-party consent is a felony. Engineering consequence: consent is a first-class step in the call flow, captured and logged before the first second of audio, not a line in a terms-of-service page.

The second gate is HIPAA. The Health Insurance Portability and Accountability Act is the US law governing protected health information, or PHI. This system handles PHI at every stage — intake answers, visit audio, the note, and the write-back to the EHR — so any vendor in the chain is a "business associate" and requires a signed Business Associate Agreement (BAA) before a single real patient is touched. A scribe-and-intake BAA should do more than the generic template: it should explicitly forbid using your patients' data to train or improve the vendor's models, commit to processing only the minimum data necessary, require HIPAA Security Rule safeguards, and ensure any subcontractor that touches the audio is itself under a BAA. If you build rather than buy, you are the party that must implement those safeguards.

The third gate is the device line — the most important scope decision in the build. Under US law, the 21st Century Cures Act of 2016 carved certain clinical software out of the Food and Drug Administration's definition of a medical "device," and the FDA's Clinical Decision Support guidance — updated in a new final version in January 2026 — spells out where that line falls. The short version for this system: software that documents what a clinician decided, or that gathers information for a clinician to act on, is generally not a device; software that makes or drives a clinical decision the clinician cannot independently review can be. The two scope lines from the start of this article are exactly what keep you on the safe side: an intake agent that gathers but does not triage, and a scribe that documents but does not diagnose, are administrative tools. The moment the intake agent tells a patient their likely condition, or the scribe asserts a diagnosis on its own authority, the product moves toward being regulated as software as a medical device — a different, slower, far more expensive path. Design against this line before you write a prompt, not after.

The fourth gate is transparency and the EU AI Act, for any patient in the European Union. Under Regulation (EU) 2024/1689, a system that interacts directly with a person must let them know they are dealing with AI, and a high-risk system carries a heavy compliance load — risk management, technical documentation, logging, human oversight, accuracy testing. The Act's transparency duties and its high-risk obligations are dated to apply from 2 August 2026, though a 2026 "Digital Omnibus" proposal would push some stand-alone high-risk deadlines into late 2027 if it is adopted — a moving target to confirm at publish time. Encouragingly, the Act lightens disclosure duties where AI output "has undergone a process of human review or editorial control and where a natural or legal person holds editorial responsibility" — which is precisely the sign-off gate at the center of this design. The same architecture that keeps the tool safe also keeps it on the lighter side of the regulation. The full regulatory picture is the subject of the EU AI Act lesson.

The rule across all four gates is the same one that governs the whole system: the human stays in control. Consent is the patient's control over being recorded; the BAA is your control over where the data goes; the device line is held by the clinician deciding rather than the model; and the EU transparency duty is lightened precisely because a clinician reviews and signs. A system that respects all four is a product; one that skips any of them is a liability.

Compliance decision map for the intake and scribe system, drawn as four gates a feature must pass before go-live. Gate one, consent to record, asks whether explicit patient recording consent is captured and logged before any audio is tapped, applying the stricter all-party state rule on a cross-state visit; if not, stop. Gate two, HIPAA, asks whether a Business Associate Agreement that forbids training on patient data is signed with every vendor in the chain and whether Security Rule safeguards are in place; if not, stop. Gate three, the device line, is a branch: if the intake agent only gathers and the scribe only documents what the clinician decided, the path is the lightest administrative-tool route; but if the intake agent tells the patient a likely condition or the scribe asserts a diagnosis on its own, the path crosses into regulated software as a medical device with FDA obligations. Gate four, EU AI Act transparency, asks whether EU patients are told they are dealing with AI, and notes that the human review-and-sign step lightens the disclosure duty, with high-risk and transparency obligations dated from August 2026 subject to a proposed Digital Omnibus postponement. A footer reads engineering context, not legal advice, confirm with counsel for each market.

Figure 6. The four compliance gates. Capture consent before audio, sign a no-training BAA, keep intake and scribe on the documentation side of the device line, and disclose AI to EU patients — the sign-off gate lightens that last duty.

Production Concerns: EHR Write-Back, Observability, And Security

Three cross-cutting concerns separate a prototype from something a health system will buy and trust.

EHR write-back is where integrations quietly fail. The note travels as a FHIR DocumentReference, but a naïve "create document" call typically files it as an unattached document or into a review bin — not into the note-writing surface where the clinician is actually working, which makes the feature feel broken even though the data technically arrived. Production write-back launches the app inside the EHR encounter with SMART on FHIR, constructs the DocumentReference with the metadata the EHR expects, lands it in the clinician's note workflow, preserves the review-and-sign step, and passes the EHR vendor's certification (Epic's App Orchard, for example). Budget this integration as real engineering, not a final afternoon; it is Milestone 3 for a reason.

Observability means you can see why a visit went wrong after it ended. Clinical-AI systems fail in ways ordinary web apps do not — a dropped audio track, a speech engine that mangled a drug name, a structuring model that omitted an allergy. You want per-visit traces — which models ran, how long each took, what the transcript and extracted facts were, what the clinician edited — collected centrally, so support can answer "why was this note wrong?" without guessing, and so you can measure your own edit rates and miss rates over time. Build this from Milestone 1; the audit log from Milestone 2 is its backbone, and retrofitting telemetry into a live clinical pipeline is painful and risky.

Security and data minimization start from the fact that visit audio and notes are among the most sensitive data a company can hold. Encrypt patient data in transit and at rest, enforce strict access controls on who can see intake summaries and notes, and keep audio no longer than you need it — many designs delete the raw recording once the note is signed, keeping only the note and the audit trail, which both reduces risk and eases the consent conversation. For customers in regulated settings, self-hosting the whole pipeline so that patient audio never leaves their infrastructure is often the deciding factor, and this architecture supports it: every component here can run inside a HIPAA-eligible perimeter. Domain adaptation of a clinical model inside that perimeter is the subject of the fine-tuning lesson.

Where Fora Soft Fits In

Fora Soft has built video software since 2005, and telemedicine is one of the verticals we ship, alongside video conferencing, streaming, e-learning, and surveillance. The system described here — a WebRTC visit underneath, clean per-speaker audio tapped from the call, an intake agent that gathers rather than triages, an extract-then-generate structuring step, a review-and-sign gate, and a FHIR write-back into the chart — is the backbone of the telehealth work we scope. The build order and the build-versus-buy verdicts in this article are not theory for us; they are the checklist we apply, because they are the difference between a scribe that saves a clinician an hour a day and one that quietly files a wrong dose. The compliance map is part of that checklist too: we wire consent into the call UI first, sign a no-training BAA before any real visit, and hold the device line by keeping the machine on the drafting side and the clinician on the deciding side. Our work here lives in telemedicine and the real-time video pipelines underneath it, where a clean call and a careful sign-off gate are the core of the product rather than a decoration.

What To Read Next

Talk To Us · See Our Work · Download

References

  1. U.S. FDA — Clinical Decision Support Software — Guidance for Industry and FDA Staff (final guidance, January 2026; supersedes the 2022 guidance; clarifies the non-device CDS criteria under the 21st Century Cures Act §3060 / FD&C Act §520(o)(1)(E)). Software that supports rather than replaces clinician decision-making can be non-device. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/clinical-decision-support-software
  2. U.S. Congress — 21st Century Cures Act, §3060 (2016) amending FD&C Act §520(o) to exclude certain clinical decision-support software from the device definition. The statutory basis for the scribe/intake "administrative tool, not a device" position. https://www.congress.gov/bill/114th-congress/house-bill/34
  3. U.S. Department of Health & Human Services — HIPAA Business Associate Contracts (45 CFR §164.504(e)) and the Security Rule (45 CFR Part 164, Subpart C). A vendor that receives, creates, maintains, or transmits PHI is a business associate requiring a BAA and safeguards. https://www.hhs.gov/hipaa/for-professionals/covered-entities/sample-business-associate-agreement-provisions/index.html
  4. Regulation (EU) 2024/1689 (EU AI Act) — Article 50 (transparency) and Annex III / Article 6 (high-risk classification); transparency and Annex III obligations dated from 2 August 2026, with §50(4) lighter disclosure where AI output undergoes human review with a responsible person. Note the 2026 "Digital Omnibus" proposal to postpone some stand-alone high-risk deadlines to late 2027 (not yet adopted at publish). https://artificialintelligenceact.eu/article/50/
  5. 18 U.S.C. § 2511 (Federal Wiretap Act) and state all-party-consent statutes (California, Illinois, Florida, Pennsylvania, Washington). Federal one-party floor; all-party states require every party's consent before recording; cross-state telehealth visits typically apply the stricter rule. https://www.law.cornell.edu/uscode/text/18/2511
  6. HL7 — FHIR R4 DocumentReference resource and SMART App Launch Framework. The standard mechanism for launching an app inside an EHR and writing a clinical note back into the chart. https://www.hl7.org/fhir/documentreference.html
  7. Koenecke et al. — "Careless Whisper: Speech-to-Text Hallucination Harms," ACM FAccT 2024. Whisper hallucinated text in ~1% of clinical-style audio segments, including fabricated medications; comparable commercial engines did not. https://facctconference.org/
  8. "Beyond human ears: navigating the uncharted risks of AI scribes in clinical practice," npj Digital Medicine (2025). LLM clinical-note error modes — one analysis reported ~1.47% hallucination and ~3.45% omission; most scribes are marketed as administrative tools outside FDA device oversight. https://www.nature.com/articles/s41746-025-01895-6
  9. AssemblyAI — Universal medical transcription benchmarks (2026). Vendor-reported ~4.97% missed-entity rate (MER) on medical terms vs ~7.32% for Deepgram Nova-3 Medical; English WER figures ~6.6% (AssemblyAI), ~8.1% (Deepgram), ~6.5% (Whisper), Feb 2026. https://www.assemblyai.com/blog/assemblyai-vs-deepgram-medical-transcription
  10. Deepgram — Nova-3 Medical and the March 2026 Nova-3 update (medical-vocabulary STT; ~34% relative batch-WER reduction in the March 2026 multilingual update). https://deepgram.com/learn/introducing-nova-3-speech-to-text-api
  11. Healthcare Dive / Healthcare IT Today — Epic launches native ambient AI charting ("Art"), February 2026 (native scribe inside the EHR; part of the Art / Penny / Emmie AI suite), and Becker's KLAS-cited market-share figures (Microsoft-Nuance ~33%, Abridge ~30%, Ambience ~13%, Suki ~10%, Nabla ~4%; Abridge ~$5.3B valuation, Series E extension April 2026; Kaiser ~24,600 physicians). https://www.healthcaredive.com/news/epic-rolls-out-ai-charting-art-notetaking-documentation-scribe/811462/
  12. Infermedica / symptom-checker triage-accuracy evaluations (2026) and JAMA Network Open (2 Oct 2025), "Use of Ambient AI Scribes to Reduce Administrative Burden and Professional Burnout" (263 clinicians, burnout 51.9%→38.8% over 30 days; ambient AI "may be scalable at a lower cost than human scribes"). Symptom checkers give correct urgency in ~58% of cases (over-triage 20–30%, under-triage 10–15%). https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2839542