Medical Large Language Models: Capabilities and Limits from Med-PaLM 2 to GPT-4
Large language models (LLMs) pass medical exams, answer patient questions, and in some studies approach physicians in clinical reasoning. The same models can also produce entirely wrong information in a convincing voice. This article addresses the real capabilities of medical LLMs such as Med-PaLM 2 and GPT-4, and the risks that limit them in clinical use.
A large language model (LLM) is a form of AI trained on enormous text corpora that generates language by predicting the next most likely word. With the widespread adoption of ChatGPT at the end of 2022, the question of how much medical knowledge these models "understand" moved to the center of medicine's agenda. The answer is both exciting and one that calls for restraint.
Exam Success: Med-PaLM 2 and Expert-Level Performance
When Google's medically tuned model Med-PaLM 2 was introduced in March 2023, a threshold was crossed. The model achieved 86.5% accuracy on questions in the style of the U.S. Medical Licensing Examination (USMLE), becoming the first to reach "expert level" on this kind of benchmark — about a 19-point jump over the previous version, Med-PaLM. In addition, physician raters preferred Med-PaLM 2's long-form answers to those of human physicians on eight of nine evaluation axes, and reported that the majority of outputs were consistent with scientific consensus. The first version of Med-PaLM was published in Nature in 2023 and the comprehensive evaluation of Med-PaLM 2 in Nature Medicine.
Exam score is not clinical competence
A high USMLE score measures the ability to recall knowledge on closed multiple-choice questions. Real clinical practice requires decision-making under uncertainty, with incomplete data, and tailored to the patient. Exam success is a necessary signal but, on its own, does not mean "safe to use."
General Models Are Strong Too: GPT-4 and Clinical Reasoning
Even general-purpose models not specifically tuned for medicine yielded striking results. A randomized study by Goh and colleagues, published in JAMA Network Open in 2024, is instructive on this point. Physicians performed diagnostic reasoning on clinical cases either with traditional resources or with traditional resources plus an LLM. The result was both impressive and sobering: the LLM alone scored significantly higher than the physicians (median around 92% vs. the physicians' approximately 74%).
The real lesson, however, was this: there was no significant difference between physicians with LLM access and those using traditional resources (about 76% vs. 74%). In other words, the model's raw power did not automatically translate into the way physicians used it. This "human–AI collaboration paradox" shows that the clinical utility of LLMs depends not only on model quality but also on how, when, and how critically clinicians use these tools. Making the model good is not enough; one must also prepare the clinician to use it well.
The Real Risk: Hallucination
The most dangerous property of LLMs is their ability to produce convincing and fluent prose that is wrong — even on topics they do not know. This is called "hallucination," and in medicine it is not an academic flaw but a direct patient-safety issue.
Evidence shows that the risk varies dramatically with context. On a controlled task — for example, summarizing a clinical note — one study reported about 1.47% hallucination and 3.45% omission for GPT-4: low, but not zero. By contrast, when the model is under stress on something it is weak at, risk soars: in a study published in Communications Medicine, a fabricated lab value or invented disease deliberately placed in a clinical vignette was repeated or elaborated on by models in up to 83% of cases. A simple safety prompt halved that rate but did not eliminate it. In addition, in literature-review contexts, models have been shown to produce fabricated citations — a particularly dangerous flaw in any work requiring medical references.
| Task / Context | Reported risk | Implication |
|---|---|---|
| Clinical note summarization (GPT-4) | ~1.47% hallucination | Low but non-zero in structured tasks |
| Fabricated data in vignette | Up to 83% repetition/elaboration | Wrong input adopted without questioning |
| Literature/citation generation | Fabricated references possible | References must not be used unverified |
Other Limits: Bias, Currency, and Lack of Evidence
Beyond hallucination, additional limits deserve attention. Bias: models can reflect demographic and geographic imbalances in their training data, with worse performance in underrepresented groups. Currency: a model's knowledge is frozen at its training cutoff — it may not be aware of the latest guideline or drug safety notice. Insufficient prospective evidence: most impressive results come from vignettes or exam settings; randomized, prospective evidence on real patient outcomes (mortality, error rates) remains limited. For these reasons, regulators have been cautious about approving generative LLMs for autonomous clinical decision-making.
A Framework for Right Use
None of this means LLMs are worthless in medicine — to the contrary, in the right frame they are powerful productivity tools. Sensible uses include drafting clinical notes, producing accessible patient information, easing administrative burdens (prior authorization, coding support), summarizing complex literature, and providing the clinician with differential diagnosis prompts. The common thread is clear: every output must be verified by a competent physician in the relevant domain; the model should be positioned as a "first-draft author," not a "final-word authority."
Conclusion
Medical large language models represent real and rapid progress — Med-PaLM 2 achieving expert-level exam performance, GPT-4 demonstrating impressive clinical reasoning. At the same time, depending on context, the same models can produce convincing errors at rates ranging from 1% to over 80%, and their raw power does not automatically translate into clinical benefit. The right stance is neither rejection nor uncritical adoption; it is to position the LLM as a strong but supervised assistant whose output is always verified by a clinician.
References
- Singhal K, et al. Toward expert-level medical question answering with large language models (Med-PaLM 2). Nature Medicine 2025. nature.com
- Singhal K, et al. Large language models encode clinical knowledge (Med-PaLM). Nature 2023. nature.com
- Goh E, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Network Open 2024;7(10):e2440969. jamanetwork.com
- A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine 2025. nature.com
- Multi-model assurance analysis: LLMs vulnerable to adversarial hallucination in clinical decision support. Communications Medicine 2025. nature.com
- Hallucination Rates and Reference Accuracy of ChatGPT and Bard. Journal of Medical Internet Research 2024. jmir.org