Agentic AI in Medicine: What Autonomous AI Agents Have and Haven't Achieved
Autonomous AI agents that plan, use tools and take action are rapidly entering medicine; benchmark results are striking, yet real-world evidence, cost and safety questions remain unresolved.
Over the past two years, the conversation about artificial intelligence in medicine has moved a step beyond question-answering chatbots toward systems that plan on their own, use tools and take action. These are now called "agentic AI." Unlike a passive "suggest and wait for approval" model, an agentic system decomposes a problem into steps, searches the web when needed, runs code, accesses a laboratory or record system, and may even coordinate with other agents. In The Lancet, Zou and Topol frame this as "the rise of agentic AI teammates in medicine." Drawing on the 2025-2026 evidence, this article takes a measured look at what these agents have genuinely achieved and what they have not yet achieved.
What is agentic AI, and how does it differ from a chatbot?
A language model (LLM) on its own generates text; an agentic system uses that model as a "brain" and adds autonomy and tools: multi-step planning, external tool calls (web, code execution, access to records or labs), and division of labour across agents. Where a single model answers a diagnostic question directly, an agentic orchestrator runs a loop such as "order this test, read the result, narrow the differential, then commit to a decision."
It is worth being honest here: the field has no shared definition, and "agentic" is partly marketing language. Tasks that models could not do in 2024 are deferred to "agents" in 2025, and what agents cannot do in 2025 is deferred to "agentic systems" in 2026 — the goalposts keep moving into the future. A tool wearing the "agentic" label does not mean it is safe or approved for clinical use.
Diagnostic benchmarks: impressive yet treacherous
The most talked-about result of 2025 was Microsoft's MAI-DxO system. Researchers converted 304 NEJM clinicopathological conference cases into a step-by-step diagnostic scenario (ask a question → order a test → narrow → commit) in an evaluation called SDBench. The orchestrator-driven MAI-DxO reached 85.5% accuracy, while a cohort of experienced physicians without access to books or colleagues scored roughly 20%. As single models, o3 scored 78.6% and GPT-4o 49.3%. A striking gap — but two caveats are essential: the cases are not real clinic encounters but artificial (selected, hard, "puzzle"-type) NEJM cases, and orchestration incurs a high simulated test cost per case.
The most important counterweight to this "superhuman" impression is the deceptiveness of static testing. The AgentClinic benchmark showed that when classic multiple-choice MedQA questions were converted into a sequential (interactive) decision format, accuracy fell sharply across all models — in some cases to below one-tenth of the original score. Furthermore, when 24 cognitive biases (such as overlooking a new symptom) were embedded into the cases, diagnostic accuracy dropped further still. In short, a model passing USMLE-style exams does not mean it will succeed at sequential clinical reasoning.
Exam success ≠ clinical readiness
A high score on static multiple-choice questions is not preserved in the sequential, real clinical flow of asking, ordering tests and deciding. AgentClinic and SDBench clearly show how accuracy collapses once the format changes. A benchmark victory is no guarantee of safety at the bedside.
Agents beat the base model — but it depends on architecture and task
A systematic review from a Mount Sinai team, published on medRxiv with a PROSPERO pre-registration, pooled 20 studies from 2024-2025. The finding was consistent: every agent system outperformed the LLM it was built on in accuracy. The largest gains came from tool-calling single-agent designs, with a median improvement of +53 points (IQR 36-56.9). Gains in multi-agent systems were more modest (+14% without tools, +17% with tools), and the best performance was generally achieved with 4-5 agents; beyond that, performance declined in an inverted-U pattern. The greatest benefit appeared in discrete, auditable micro-tasks such as drug-dose calculation and evidence/literature retrieval.
Yet the same review imposed a critical limit: 65% of the studies used only synthetic data, all were single-centre, and only one was a randomized controlled trial (RCT). The large effects were therefore measured mostly in artificial test environments; no effect on patient outcomes (mortality, complications) was demonstrated.
Clinical RCTs: collaboration design decides everything
The RCT titled "From Tool to Teammate" (70 US-licensed physicians; Stanford, BIDMC, Vanderbilt and others) tested human-AI collaboration in diagnostic reasoning and was published first on medRxiv and then in the peer-reviewed npj Digital Medicine (March 2026). The findings are both encouraging and cautionary:
- With traditional sources (UpToDate/PubMed/Google) accuracy was 75%, rising to 85% in the arm where AI gave a first opinion (+9.9%; 95% CI 4.7-15; p=0.0004). In the AI second-opinion arm it was 82% (+6.8%; p<0.001).
- However, the AI-alone arm reached 87-90% accuracy, and the physician+AI arms did not statistically surpass it (p=0.20). The slogan "physician + AI always beats AI alone" was not confirmed here; the issue is complementarity and interaction design.
- Anchoring worked in both directions: in the AI second-opinion arm the model, despite instructions, anchored to the physician's input (48% exact overlap on diagnoses, 52% on next steps). This shows that the "sycophancy" tendency of LLMs can corrupt a supposedly independent second opinion.
- A safety signal: in 8% of cases the actionable-decision score fell after the AI interaction — meaning an AI suggestion can sometimes degrade the best performance.
Ambient documentation: agentic AI's most mature but narrowest form
Today, the strongest RCT-level evidence in clinical AI is not in autonomous diagnosis but in ambient documentation. UCLA's pragmatic RCT (238 ambulatory clinicians, 14 specialties; DAX Copilot, Nabla and usual-care arms) significantly reduced time per note; in the Nabla arm note time fell by roughly 41 seconds. UW Health data reported about 30 minutes less documentation per provider per day. These results were reported in NEJM AI and on medRxiv.
A deliberate boundary must be drawn here: an ambient scribe is a narrow, transcription-based, human-approved tool. It turns the clinician's speech into a note; it does not order tests, adjust doses, or run chained actions in the record system. So although it offers the most mature clinical benefit, it is not the same as true agentic AI in the "autonomous plan-and-act" sense and should not be conflated with it.
Concrete successes in research and the laboratory
Moving away from diagnosis toward research, the evidence is stronger. Stanford's Biomni agent mined tools and protocols across 25 biomedical domains and showed strong generalization on tasks such as causal gene prioritization and rare-disease analysis without task-specific training. CRISPR-GPT (Nature Biomedical Engineering, 2025) autonomously planned gene-editing experiments, and its recommendations were validated with wet-lab experiments in real cell lines (four gene knockouts, two epigenetic activations). These examples represent auditable domains where agents can deliver high value as a "co-pilot."
Cost, hallucination and an honest comparison
There is also a "cold shower" for the positive benchmark headlines. A study evaluating advanced agent systems such as OpenManus and Manus (npj Digital Medicine, 2026) found only modest gains over the base LLM despite enhanced tool access. In the same study token usage rose more than 10-fold and latency more than 2-fold; although 89.9% of hallucinations were filtered by in-agent safeguards, residual hallucination remained at a frequency unacceptable in the clinic. A separate evaluation taxonomy (Vatsal et al., 2026) showed that about 98% of agent systems lack distribution-shift (drift) detection and 92% lack event-triggered activation — meaning real-world monitoring and safety infrastructure is still very weak.
| Approach / Study | Accuracy | Evidence type and caveat |
|---|---|---|
| MAI-DxO orchestrator (SDBench) | 85.5% (physicians ~20%) | Benchmark; artificial NEJM cases, high cost |
| Single model o3 / GPT-4o (SDBench) | 78.6% / 49.3% | Benchmark; sequential format |
| Tool-calling single-agent (systematic review) | median +53 points | 65% synthetic data, single-centre, 1 RCT |
| AI first opinion vs traditional (RCT) | 85% vs 75% (+9.9%) | RCT; did not surpass AI-alone 87-90% |
| Ambient scribe (UCLA RCT) | ~41 sec less per note | RCT; narrow, human-approved, not autonomous |
| OpenManus/Manus (npj, 2026) | modest gain | Tokens >10×, latency >2×, hallucination |
Regulatory framework and the accountability gap
The WHO's January 2024 guidance on large multi-modal models (LMMs) sets out more than 40 recommendations emphasizing protection of autonomy, accountability, transparency and human oversight, and holds developers responsible for design flaws. In the US, the number of FDA-cleared AI devices has passed 1,300, but the overwhelming majority are narrow diagnostic imaging tools; there is as yet no separate clearance category for autonomous clinical-decision agents, and such systems face the strictest pathway. In Europe, the MDR and the AI Act mandate human oversight for high-risk systems. Specialty journals (The Lancet Rheumatology, Radiology) increasingly propose tiered governance: hierarchical safety architectures of agents overseeing agents.
The most unresolved problem is accountability. When an autonomous chain causes harm, it is unclear who is responsible; the term "agentic" risks blurring the line between decision support and true autonomy. Moreover, the basic tasks most suited to automation (dose calculation, evidence retrieval) are the very tasks through which trainees build clinical expertise — raising a concern about educational erosion.
Conclusion
As of 2025-2026, agentic AI has begun a genuine paradigm shift in medicine: tool-calling agents consistently beat base language models, delivering large gains on discrete tasks such as dose calculation and evidence retrieval; ambient documentation reduces documentation time and burnout at the RCT level; and well-designed clinician-AI collaboration can raise diagnostic accuracy. These are not achievements to dismiss.
Yet the other half of the evidence is plainly cautionary: real-world, multi-centre, prospective patient-outcome data do not yet exist; most studies are synthetic and single-centre. Benchmark victories do not transfer to sequential clinical flow or to different populations; anchoring, sycophancy and hallucination can corrupt independent judgement; computational cost is high; and the monitoring infrastructure and accountability framework remain immature. The balanced reading is this: agentic AI can deliver value today as a supervised co-pilot, but the prospective safety and outcome evidence required to justify an autonomous clinical-decision "teammate" has not yet accumulated. The enthusiasm must be balanced — without exaggeration — by the rigorous studies that will close this evidence gap.
References
- Zou J, Topol EJ. The rise of agentic AI teammates in medicine. The Lancet. 2025. DOI
- Nori H, et al. Sequential Diagnosis with Language Models (MAI-DxO / SDBench). Microsoft AI, arXiv. 2025. arXiv
- Gorenshtein A, et al. AI Agents in Clinical Medicine: A Systematic Review. medRxiv. 2025. DOI
- From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis. npj Digital Medicine. 2026. DOI
- Liu X, et al. Benchmarking large language model-based agent systems for clinical decision tasks. npj Digital Medicine. 2026. DOI
- Ambient AI scribe pragmatic RCT (UCLA; DAX/Nabla). NEJM AI / medRxiv. 2025. DOI
- Schmidgall S, et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv. 2024-2025. arXiv
- Agentic AI in rheumatology: tiered governance and risks. The Lancet Rheumatology. 2026. DOI
- CRISPR-GPT for agentic automation of gene-editing experiments. Nature Biomedical Engineering. 2025. DOI
- World Health Organization. Ethics and Governance of Artificial Intelligence for Health: Guidance on Large Multi-Modal Models. WHO. 2024. who.int
- Vatsal S, et al. Agentic AI in Healthcare and Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation. arXiv. 2026. arXiv
- Huang K, et al. Biomni: A General-Purpose Biomedical AI Agent. bioRxiv. 2025. DOI