A hallucination is what happens when a large language model (LLM) — the kind of AI behind clinical scribes, chat triage, and visit summaries — produces an answer that reads fluently and confidently but is simply wrong. In a healthcare setting that can mean a symptom the patient never mentioned appearing in a note, an invented dosage, or a citation to a study that does not exist. The danger is precisely that the output looks correct: it follows the grammar and tone of a real clinical sentence, so a busy reader can wave it through. Hallucination is not an occasional bug to be patched; it is an inherent property of how these models generate text, which is why it must be engineered around rather than wished away.
For a telemedicine product team this is a patient-safety issue first and a quality issue second, and it should be treated with the same seriousness as a security defect. The standard mitigations work in layers. Grounding the model in real source data — retrieval-augmented generation (RAG), or generation anchored to the actual visit transcript — keeps it from filling gaps with invention. Constrained output formats, automated consistency checks against structured data, and mandatory clinician review (human-in-the-loop) catch what slips through. None of these is sufficient alone; safety comes from the stack.
The common mistake is shipping on vibes. "The model seems good in our tests" is not an evaluation. Each AI feature needs an explicit, measured hallucination rate against a labeled gold set, monitored over time as models and prompts change. Where a feature touches diagnosis or medication, this evidence also feeds the regulatory story — software that influences clinical decisions can fall under FDA Software as a Medical Device (SaMD) oversight, and an unmeasured failure mode is indefensible there.

