
Image Source: ChatGPT-4o
Microsoft AI Diagnoses Complex Medical Cases with 85% Accuracy
Microsoft has introduced a new AI system designed to tackle some of medicine’s most challenging diagnostic puzzles—and early results suggest it outperforms human doctors by a wide margin.
In internal research released by the Microsoft AI team, the company’s new AI Diagnostic Orchestrator, known as MAI-DxO, correctly diagnosed up to 85% of real-world medical cases published in the New England Journal of Medicine (NEJM)—compared to just 20% accuracy among experienced clinicians.
The study used 304 difficult case records from NEJM’s Case Record of the Massachusetts General Hospital series. These cases typically require input from multiple specialists and extensive testing. Microsoft turned them into interactive “stepwise” challenges that reflect how doctors practice real-world diagnostic reasoning: collecting symptoms, ordering tests, and revising hypotheses along the way.
“We believe that orchestrating multiple language models will be critical to managing complex clinical workflows,” Microsoft AI researchers wrote, noting that MAI-DxO can deliver better performance while using fewer resources.
Moving Beyond Multiple-Choice Benchmarks
Traditional AI benchmarks in healthcare have relied heavily on exams like the USMLE, the U.S. medical licensing test, which use multiple-choice formats. But Microsoft argues these tests overstate AI's abilities by emphasizing memorization instead of clinical reasoning.
To address that, the team developed a new Sequential Diagnosis Benchmark (SD Bench), turning NEJM case studies into cost-aware diagnostic simulations. Both humans and AI models can ask questions and order virtual tests, with each action incurring a simulated cost to mirror real healthcare costs. For example, when a patient presents with a cough and fever, a clinician might first order blood tests and a chest X-ray before confidently diagnosing pneumonia.
A short video shows how the AI system works through one of these diagnostic challenges step by step.
Microsoft tested its AI system against a wide range of top-performing language models, including GPT, Llama, Claude, Gemini, Grok, and DeepSeek. Each model was evaluated using the Sequential Diagnosis Benchmark, which simulates real-world medical reasoning by allowing step-by-step questioning and diagnostic testing.
MAI-DxO acted as an orchestrator—essentially a virtual panel of physicians—by coordinating multiple models with diverse diagnostic strategies. This orchestration significantly improved the performance of every individual model tested.
By coordinating multiple models, orchestrators like MAI-DxO can integrate a wider range of information sources more effectively than any single model. This approach also improves safety, transparency, and adaptability—key qualities in complex, fast-changing clinical environments. Because MAI-DxO is model-agnostic, it supports auditability and resilience, both essential for high-stakes medical decision-making.
The strongest results came from pairing MAI-DxO with OpenAI’s o3 model, which correctly diagnosed 85.5% of the 304 NEJM benchmark cases. For comparison, Microsoft also assessed 21 practicing physicians from the U.S. and U.K., each with between 5 and 20 years of clinical experience. On the same benchmark, those clinicians achieved an average diagnostic accuracy of just 20%.
Cost-Efficient Care and Broader Expertise
Beyond accuracy, MAI-DxO also outperformed physicians in cost efficiency. The system is designed to operate within a set of cost limits to avoid ordering every possible test—balancing diagnostic power with resource use, patient comfort, and timeliness of care. In testing, MAI-DxO not only delivered more accurate diagnoses but also did so at a lower overall cost than both individual AI models and human physicians.
The research highlights a key difference between human and AI diagnosis: while physicians often specialize in either broad general care or narrow areas of expertise, no single doctor can cover the full range of complex cases seen in the NEJM series. MAI-DxO isn’t limited in the same way. It blends both breadth and depth of medical knowledge, demonstrating clinical reasoning skills that, in many areas, surpass those of any individual physician.
Microsoft’s broader health AI initiatives include tools like RAD-DINO to accelerate radiology workflows and Dragon Copilot, a voice-first AI assistant for clinicians. These efforts now complement a new consumer health initiative launched in late 2024, aiming to serve the millions who turn to Microsoft’s platforms daily for health-related advice.

Comparison of AI powered diagnostic agents by accuracy and average diagnostic test cost per case. Top performing agents appear toward the top left quadrant, reflecting higher accuracy and lower cost. The lower dotted line represents the performance range of the best individual foundation models. The purple line traces the performance of MAI-DxO across different configurations. The red cross indicates the average performance of 21 practicing physicians. Image Source: Microsoft
Looking Ahead
Microsoft’s findings suggest that AI systems like MAI-DxO could reshape how care is delivered. By blending broad and deep medical knowledge, these tools could help patients self-manage routine concerns and provide clinicians with powerful support on complex cases. The potential impact is especially important in the U.S., where health spending is approaching 20% of GDP—and up to 25% of that may be wasted on services that don’t improve outcomes.
Still, the research comes with clear limitations. While MAI-DxO performed exceptionally on complex diagnostic challenges, more work is needed to evaluate how it handles common, day-to-day medical presentations.
To fairly compare AI to human clinicians, participating physicians in the study worked without the usual support tools—no colleagues, textbooks, or AI systems—which may understate real-world doctor performance.
One of the most novel aspects of Microsoft’s work is its attention to cost. Although healthcare costs vary widely across systems and regions, the team applied a consistent methodology across all cases to better understand the balance between diagnostic accuracy and resource use.
Microsoft views this as a first step. The company is now partnering with leading health organizations to test these systems in real-world clinical settings, with a focus on safety, reliability, efficacy, and regulatory oversight. That validation will be critical before any broader deployment.
The future of healthcare, Microsoft believes, will be shaped by augmenting human skill and empathy with the power of machine intelligence.
What This Means
Microsoft’s MAI-DxO signals a major step toward medical superintelligence—AI that can not only replicate human expertise but expand it across a wider range of diagnostic challenges. By outperforming experienced doctors in accuracy and cost-effectiveness, this system highlights the growing role AI could play in reshaping how care is delivered.
The implications are sweeping: AI could reduce misdiagnoses, lower healthcare costs, and help patients manage conditions with less reliance on overwhelmed medical systems. With U.S. healthcare spending nearing 20% of GDP—and up to one-quarter of that considered waste—tools like MAI-DxO could play a key role in improving efficiency and outcomes.
But this is still early-stage research. Microsoft acknowledges that more testing is needed in everyday medical scenarios, and that regulatory oversight will be essential before broad deployment.
Still, the results show what’s now possible when advanced AI reasoning is paired with rigorous clinical benchmarks. If responsibly developed, systems like MAI-DxO could become a trusted partner—not a replacement—for human care.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.