AI Triage and Symptom Checkers: How Reliable Are the Digital Gatekeepers?
When you feel a symptom and wonder, "Should I go to the ER, book an appointment, or wait at home?" — more and more people now put that question to an app. AI-based symptom assessment tools are becoming a new gateway to the health system. How accurate are these digital gatekeepers, and where are they safe? This article looks at the evidence.
Triage is one of medicine's oldest yet most critical decisions: who needs urgent care, and how urgently? Traditionally this is done by a nurse or physician. In recent years symptom checker apps and large language models have brought this decision directly into the patient's pocket. The user enters their symptoms; the app provides possible causes and an acuity recommendation (emergency department / primary care / home monitoring). The promise is appealing: reduce unnecessary ED visits, ease access, and route patients to the right level of care. Whether the promise holds up under scrutiny requires a careful look at the evidence.
A Prominent Example: Ada Health
One of the most studied examples in this field is Ada Health. Launched in 2016, Ada's diagnostic algorithm was originally developed to help clinicians diagnose rare diseases; over time it grew into a widely used consumer app in Germany, the United Kingdom, and the United States. A key point is that Ada explicitly positions itself as a triage/information tool that does not diagnose — only provides information and direction — a boundary that is both honest and regulatorily important.
Diagnosis and triage are different tasks
"What is this disease?" (diagnosis) and "What should this person do right now?" (triage) are separate questions. Evidence shows that symptom-checker tools generally do better at triage than at diagnosis, because an acuity recommendation that errs on the safe side is easier than landing the exact right diagnosis.
What Does the Evidence Say?
Independent studies have measured the performance of these tools for some time. Comprehensive studies in BMJ Open comparing popular symptom checkers against physician panels paint a consistent picture: diagnostic accuracy is variable, and physicians substantially outperform symptom checkers at listing the correct diagnosis first. On the "first guess is right" criterion, the human physician remains clearly ahead.
On triage the picture is more mixed but more hopeful. In comparative studies using ED data, the correct triage recommendation rate of various platforms is reported around 58%, with about 20–30% of recommendations resulting in "overtriage" (more urgent referral than warranted) and 10–15% in "undertriage" (less urgent than warranted). This asymmetry is clinically meaningful: overtriage produces resource use and unnecessary anxiety, while undertriage — sending a serious case home — is a direct safety risk and the most critical weakness of these tools.
A New Player: Large Language Models
A study published in npj Digital Medicine in 2025 directly compared online symptom-assessment apps, large language models (LLMs), and laypeople in self-triage decisions. LLMs entered this space as strong candidates, offering on some tasks a more flexible and natural interaction than rule-based symptom checkers. But the risks of LLMs in this context were inherited as well: the same models can give confidently incorrect acuity recommendations and accept user mis-descriptions of symptoms without question. In other words, changing the technology does not eliminate the underlying safety questions — it only changes their form.
The User Factor: Output Is Only as Good as Input
A critical but under-discussed factor that determines real-world accuracy is the user. A symptom checker is only as good as the description of symptoms it receives. A user unfamiliar with medical terminology may misdescribe "chest pain" as muscle pain or omit a key accompanying symptom (such as shortness of breath or sweating). The rich context available to a physician — body language, tone, physical exam findings, prior history — is largely hidden from the app. The output of these tools is therefore valuable as a starting point, not as a final decision.
Correct Positioning
All of this evidence does not say AI triage is worthless; it says its place must be clearly defined. Reasonable uses: a pre-assessment that routes patients to the right level of care; a first guide outside physician hours; a knowledge resource that improves health literacy. The golden rule of safe design is "lean toward the safer side under uncertainty" (safety-netting): when in doubt, the tool should direct the user to the more cautious, higher level of acuity, and clearly state red-flag warnings such as "if you experience these symptoms, go to the emergency department immediately."
Conclusion
AI-based symptom assessment and triage tools have become a real part of how people access care, and used correctly they provide valuable pre-direction. But the evidence is clear: they do not replace the physician in diagnosis, are accurate at triage roughly one-half to two-thirds of the time, and their greatest risk is undertriage of serious conditions. The right framing is to see them as a compass that shortens the path to a physician — never as a diagnostic machine that takes the physician's place.
References
- Accuracy of online symptom assessment applications, large language models, and laypeople for self-triage decisions. npj Digital Medicine 2025;8:178. nature.com
- Gilbert S, et al. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? BMJ Open 2020. bmjopen.bmj.com
- Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians in an Emergency Department. JMIR mHealth and uHealth 2023. mhealth.jmir.org
- Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study. PMC9531004. ncbi.nlm.nih.gov
- Ada Health — How we test the performance of AI health assessment tools (methodology). about.ada.com