AI system matches diagnostic accuracy while cutting medical costs

14 hours ago 2

In a new study, Microsoft’s AI-powered diagnostic system outperformed experienced doctors in solving the most challenging medical cases faster, cheaper, and more accurately.

Doctor reviewing patient data on a digital tablet surrounded by holographic medical imaging and diagnosticsStudy: Sequential Diagnosis with Language Models. Image credit: metamorworks/Shutterstock.com

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

A recent study on the ArXiv preprint server compared the diagnostic accuracy and resource expenditure of AI systems with those of clinicians regarding complex cases. The Microsoft AI team demonstrated the efficient use of artificial intelligence (AI) in medicine to tackle diagnostic challenges that physicians struggle to decipher.

Sequential diagnosis and language models

Often, physicians diagnose patients for an ailment through a clinical reasoning process that involves step-by-step, iterative questioning and testing. Even with limited initial information, clinicians narrow down the possible diagnosis by questioning the patient and confirming through biochemical tests, imaging, biopsy, and other diagnostic procedures.

Solving a complex case requires a wide-ranging set of skills, including determining the most critical following questions or tests, staying aware of test costs to prevent increasing patient burden, and recognizing evidence to make a confident diagnosis.

Multiple studies have demonstrated the enhanced efficiency of language models (LMs) in performing in medical licensing exams and highly structured diagnostic vignettes. However, the performance of most LMs was evaluated under artificial conditions, which drastically differ from real-world clinical settings.

Most LMs models for diagnostic assessments are based on a multiple-choice quiz, and the diagnosis is made from a predefined answer set. A reduced sequential diagnosis cycle increases the risk of overstating static benchmarks' model competence. Furthermore, these diagnostic models present the risk of indiscriminate test ordering and premature diagnostic closure. Therefore, there is an urgent need for an AI system based on a sequential diagnosis cycle to improve diagnostic accuracy and reduce test costs.

About the study

To overcome the above-stated drawbacks of LMs models for clinical diagnosis, scientists have developed the Sequential Diagnosis Benchmark (SDBench) as an interactive framework to evaluate diagnostic agents (human or AI) through realistic sequential clinical encounters.

To assess diagnostic accuracy, the current study utilized weekly cases published in The New England Journal of Medicine (NEJM), the world’s leading medical journal. This journal typically publishes case records of patients from Massachusetts General Hospital in a detailed, narrative format. These cases are among the most diagnostically challenging and intellectually demanding in clinical medicine, often requiring multiple specialists and diagnostic tests to confirm a diagnosis.

SDBench recast 304 cases from the 2017- 2025 NEJM clinicopathological conference (CPC) into stepwise diagnostic encounters. The medical data spanned clinical presentations to final diagnoses, ranging from common conditions (e.g., pneumonia) to rare disorders (e.g., neonatal hypoglycemia). Using the interactive platform, diagnostic agents decide which questions to ask, which tests to order, and when to confirm a diagnosis.

Information Gatekeeper is a language model that selectively discloses clinical details from a comprehensive case file only when explicitly queried. It can also provide additional case-consistent information for tests not described in the original CPC narrative. After making the final diagnosis based on information obtained from the Gatekeeper, the accuracy of the clinical evaluation was tested against the real diagnosis. In addition, the cumulative cost of all requested diagnostic tests conducted in real-world diagnosis was estimated. By evaluating diagnostic accuracy and cost, SDBench indicates how close we are to high-quality care at a sustainable cost.

Study findings

The current study analyzed the performance of all diagnostic agents on the SDBench. AI agents were evaluated on all 304 NEJM cases, while physicians were assessed on a held-out subset of 56 test-set cases. This study observed that AI agents performed better on this subset than physicians.

Physicians practicing in the USA and UK with a median of 12 years of clinical experience achieved 20% diagnostic accuracy at an average cost of $2,963 per case on SDBench, highlighting the benchmark's inherent difficulty. Physicians spent an average of 11.8 minutes per case, requesting 6.6 questions and 7.2 tests. GPT -4o outperformed physicians in terms of both diagnostic accuracy and cost. Commercially available off-the-shelf models offered varied diagnostic accuracy and cost.

The current study also introduced the MAI Diagnostic Orchestrator (MAI-DxO), a platform co-designed with physicians, which exhibited higher diagnostic efficiency than human physicians and commercial language models. Compared to commercial LMs, MAI-DxO demonstrated higher diagnostic accuracy and a significant reduction in medical costs of more than half. For instance, the off-the-shelf O3 model achieved diagnostic accuracy of 78.6% for $7,850, while MAI-DxO achieved 79.9% accuracy at just $2,397, or 85.5% at $7,184.

MAI-DxO accomplished this by simulating a virtual panel of “doctor agents” with different roles in hypothesis generation, test selections, cost-consciousness, and error checking. Unlike baseline AI prompting, this structured orchestration allowed the system to reason iteratively and efficiently.

MAI-DxO is a model-agnostic approach that has demonstrated accuracy gains across various language models, not just the O3 foundation model.

Conclusions and future outlooks

The current study's findings demonstrate AI systems' higher diagnostic accuracy and cost-effectiveness when guided to think iteratively and act judiciously. SDBench and MAI-DxO provided an empirically grounded foundation for advancing AI-assisted diagnosis under realistic constraints.

In the future, MAI-DxO must be validated in clinical environments, where disease prevalence and presentation occur as frequently as daily, rather than as a rare occasion. Furthermore, large-scale interactive medical benchmarks involving more than 304 cases are required. Incorporation of visual and other sensory modalities, such as imaging, could also enhance diagnostic accuracy without compromising cost efficiency.

However, the authors note important limitations. NEJM CPC cases are selected for their difficulty and do not reflect everyday clinical presentations. The study did not include healthy patients or measure false positive rates. Moreover, diagnostic cost estimates are based on U.S. pricing and may vary globally.

The models were also tested on a held-out test set of recent cases (2024-2025) to assess generalization and avoid overfitting, as many of these cases were published after the training cutoff for most models.

The paper also raises a broader question: Should we compare AI systems to individual physicians or full medical teams? Since MAI-DxO mimics multi-specialist collaboration, the comparison may reflect something closer to team-based care than individual practice.

Nonetheless, the research suggests that structured AI systems like MAI-DxO may one day support or augment clinicians, particularly in settings where specialist access is limited or expensive.

Download your PDF copy now!

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:

  • Preliminary scientific report. Nori, H. et al. (2025) Sequential Diagnosis with Language Models. ArXiv. https://arxiv.org/abs/2506.22405 https://arxiv.org/abs/2506.22405
Read Entire Article