- AiNews.com
- Posts
- OpenAI Launches HealthBench to Evaluate AI in Healthcare
OpenAI Launches HealthBench to Evaluate AI in Healthcare

Image Source: ChatGPT-4o
OpenAI Launches HealthBench to Evaluate AI in Healthcare
OpenAI has introduced HealthBench, a new evaluation benchmark aimed at improving the way AI systems are tested and validated for use in healthcare settings. Designed with input from 262 physicians practicing across 60 countries, HealthBench is built to measure how well large language models perform in realistic, high-impact health scenarios.
The company announced HealthBench as part of its broader goal to ensure AI models can safely and effectively support human health. While large language models have the potential to expand access to information, assist clinicians, and empower patients, OpenAI emphasized that robust, real-world evaluations are essential for achieving these outcomes.
Why HealthBench Matters
Existing health-related AI evaluations often fall short, according to OpenAI, either by focusing on unrealistic test scenarios, lacking validation against medical expertise, or leaving little room for model improvement. HealthBench addresses these gaps with a framework centered on three core principles:
Meaningful: Scores reflect real-world clinical impact, moving beyond basic exam questions that reflect how individuals and clinicians interact with AI systems in real-world settings.
Trustworthy: Evaluations are grounded in expert physician judgment providing a solid foundation for improving AI systems.
Unsaturated: Benchmarks leave space for models to grow and improve over time.
OpenAI has also published baseline results for several of its latest models, aiming to set new performance standards for AI in healthcare.
Inside HealthBench
HealthBench is built on 5,000 simulated conversations between AI models and users — including both individual patients and healthcare professionals. These interactions were crafted through synthetic generation and adversarial testing to closely mirror real-world scenarios. They feature:
Multi-turn conversations in multiple languages.
Diverse medical contexts, spanning various specialties and healthcare roles.
Challenging examples, selected to test model limitations.
Each conversation is paired with a detailed, physician-created grading rubric, containing 48,562 unique evaluation criteria. These rubrics specify what an ideal AI response should include or avoid and weight each point based on clinical importance.
Model responses are assessed using a model-based grader (GPT‑4.1), which evaluates whether each criterion is met and assigns a total score.
Focus Areas: Themes and Axes
HealthBench organizes its evaluations into seven major themes, each reflecting a critical aspect of healthcare interactions:
Emergency referrals: Triaging urgent situations safely and avoiding delayed care.
Expertise-tailored communication: Adapting tone, terminology, and detail based on user expertise for clarity and usefulness.
Responding under uncertainty: Expressing appropriate caution when evidence is limited.
Response depth: Providing the right level of information for the user's needs to make informed decisions.
Health data tasks: Safe handling of documentation and clinical support tasks.
Global health: Adjusting for regional differences in resources and practice norms across multiple languages.
Context seeking: Identifying when missing information is needed for accuracy and prompts for additional information only when necessary.
Each conversation is also graded across several axes of model behavior, including communication quality, instruction following, factual accuracy, context awareness, and completeness.
Axes of Evaluation
In addition to themes, HealthBench evaluates AI model performance across five key axes, each capturing an important dimension of behavior in health-related interactions:
Communication quality: Assesses the clarity, structure, vocabulary, and level of detail in a model's response to ensure it is appropriate for the user and situation.
Instruction following: Measures how well a model adheres to specific user directions, such as formatting a response or completing a particular task as requested.
Accuracy: Evaluates whether the model provides factually correct information, supports its claims with evidence or consensus when available, and appropriately expresses uncertainty when evidence is limited.
Context awareness: Checks if the model responds appropriately based on the user’s context — such as recognizing their role, available resources, or setting — and whether it seeks clarification only when necessary.
Completeness: Ensures the model addresses all aspects of the query, including recognizing critical follow-up actions like advising emergency care for red-flag symptoms.
Each axis helps build a comprehensive view of model strengths and weaknesses, offering a more nuanced assessment than scoring final answers alone.
This chart highlights how leading AI models perform across key healthcare communication skills, including instruction following, context awareness, accuracy, and completeness.
Building a Global, Inclusive Benchmark
HealthBench was developed with significant global input. The participating 262 physicians in 60 countries are fluent in 49 languages and trained across 26 medical specialties, from emergency medicine to psychiatry to vascular surgery.
Languages represented include English, Arabic, Mandarin Chinese, French, Hindi, Spanish, Swahili, Turkish, and many others — reflecting the global diversity of real-world healthcare settings.
Medical specialties represented include internal medicine, pediatrics, dermatology, obstetrics and gynecology, general surgery, neurology, radiology, anesthesiology, and public health — covering a wide range of clinical expertise critical for comprehensive health evaluations.
Model Performance on HealthBench
OpenAI used HealthBench to evaluate the progress of several of its frontier models. Key findings include:
Rapid improvement: Newer models like o3 outperformed earlier ones, including GPT-4o (August 2024), with a 28% improvement in benchmark scores.
Cost-efficiency gains: Smaller models such as GPT-4.1 nano achieved higher performance at a fraction of the cost — being 25 times cheaper than GPT-4o (August 2024).
Enhanced reliability: Recent models demonstrated better "worst-of-n" performance, meaning their least accurate outputs improved significantly.
Advances in reasoning models: Comparisons across o3, o4-mini, and o1 models — representing low, medium, and high reasoning capabilities — revealed steady improvements in performance with increased test-time compute. OpenAI noted that reasoning-focused models are expected to continue driving progress on the performance-cost frontier in the coming months
New HealthBench Variants
To further refine evaluation, OpenAI introduced two specialized versions:
HealthBench Consensus: A highly validated subset of 3,671 examples where rubric criteria were carefully filtered based on strong physician agreement. Each criterion is included only if a majority of multiple physicians agreed it was appropriate for the example. This variant was designed to establish a baseline of nearly zero errors, offering researchers and developers a trusted foundation for measuring AI model reliability at the highest standards.
HealthBench Hard: A challenging set of 1,000 examples specifically selected because even the most advanced frontier models struggle with them. These cases highlight complex, underspecified, or high-risk medical scenarios that stress-test model reasoning and decision-making. HealthBench Hard is intended to create clear goals for the next generation of AI models to surpass.
Early results show that newer models like o3 and GPT‑4.1 substantially reduce errors on both the Consensus and Hard sets, although significant room for improvement remains.
Comparing AI to Human Physicians
To better understand model performance in a real-world context, OpenAI compared AI model responses on HealthBench examples with those written by practicing physicians. The goal was to establish human baselines and measure how AI models are progressing toward expert-level judgment.
In these evaluations:
Expert physicians were asked to write ideal responses to HealthBench prompts without using AI tools, relying only on their medical knowledge and optional internet research.
A second group of physicians was given access to AI-generated responses from OpenAI’s September 2024 models (o1-preview and GPT-4o) and asked to improve or refine those responses as needed.
Results showed that:
Model-assisted physicians outperformed standalone September 2024 model responses, suggesting that human oversight could meaningfully enhance earlier AI outputs.
Both the September 2024 models and model-assisted physician responses outperformed physicians working without AI assistance — highlighting the potential of collaboration between AI and human experts.
When the same experiments were repeated using responses from OpenAI’s latest April 2025 models (o3 and GPT‑4.1), physicians no longer improved on the AI-generated responses, indicating that these frontier models now perform at — or above — the level of unaided physician expertise in this evaluation setting.
OpenAI emphasized that while these results are promising, especially for specific tasks like written health information delivery, substantial work remains to ensure AI systems can match human performance across the full complexity of real clinical environments.
Trustworthiness and Evaluation Quality
OpenAI also validated HealthBench’s grading process by comparing model-generated evaluations with those performed by physicians. The company found strong alignment between AI and physician grading, indicating that HealthBench is a trustworthy measure of model quality in health-related contexts.
OpenAI has made the full evaluation suite and dataset publicly available on GitHub, inviting researchers and developers across the AI and healthcare communities to build on the benchmark and contribute to shared progress.
What This Means
The launch of HealthBench marks a significant step toward responsibly integrating AI into healthcare. By focusing on realistic scenarios, diverse medical contexts, and expert judgment, OpenAI is setting a higher bar for model safety, reliability, and utility in high-stakes environments.
HealthBench also highlights the growing sophistication of AI models, which are beginning to match — and in some cases exceed — human expert performance in controlled evaluations. However, OpenAI emphasizes that substantial challenges remain, particularly around worst-case reliability and the ability to seek out missing context in underspecified situations.
HealthBench not only raises the standards for evaluating AI in healthcare, but also marks a broader shift toward holding AI systems accountable for real-world impact — a crucial step as these technologies move closer to everyday clinical use.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.