HealthBench: Evaluating Large Language Models Towards Improved Human Health

About

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui\~nonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal• 2025

Related benchmarks

Task	Dataset	Result
Biomedical Multimodal Reasoning	LAB-Bench	Cloning Score30.3	18
Medical Domain Question Answering	HealthBench In-domain, Seen	Score20.47	14
Instruction Following Evaluation	FollowBench OOD	HSR58.31	14
Medical Domain Question Answering	RaR-Medicine In-domain Unseen	Score16.16	14
Multi-turn Dialogue Evaluation	MT-Bench OOD	R1 Score7.25	14
Rubric Generation	OSS EVAL 300	Spearman's Rho1	7
Medical Response Refinement	HealthBench 254 medical queries	Base Score58.9	4
Medical Question Answering	HealthBench Consensus	Score82	3
Medical Question Answering	HealthBench oss_eval (full)	Overall Score47	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord