Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HealthBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Medical ReasoningHealthBench Hard
Accuracy40.74
41
Medical Question AnsweringHealthBench Medicine N=5,000 (overall)
Rubric Score26.1
36
Medical ReasoningHealthBench
Accuracy70.41
36
Health-related dialogue and decision-makingHealthBench Main
Average Score46.38
24
Medical Question AnsweringHealthBench Hard
Accuracy39.02
19
Long-horizon agentic taskHealthbench Hard
Performance28.06
18
Medical and Health KnowledgeHealthBench
Accuracy37.2
17
Health DialogueHealthBench
Accuracy44.92
17
Deep ResearchHealthBench
Score59.5
17
Medical Question AnsweringHealthBench Overall
Overall Score60.1
16
Health Multimodal EvaluationHealthBench English (test)
Overall Score32.7
15
Treatment planningHealthBench treatment-related conversations
Overall Score48.94
15
MedicalHealthBench-500
Score43.6
15
Long-form researchHealthBench
Overall Score59.5
14
Medical Question AnsweringHealthBench Hard 1000
Accuracy86
12
Long-form deep-research answeringHealthBench
Score54
11
Open-ended Medical ConsultationHealthBench Hard
Total Score46.2
11
Clinical ReasoningHealthBench Professional (525 cases)
Overall Score62.72
11
Medical KnowledgeHealthBench
Pass@192.82
11
LLM EvaluationHealthBench (test)
HealthBench Score (%)62.6
11
Medical Question AnsweringHealthBench Medical
Score56.36
10
Health-related evaluationHealthBench
HealthBench Score43
9
Health-domain instruction followingHealthBench 1K-example (eval)
Score78.8
8
Medical Question AnsweringHealthBench Professional
Score62.7
7
Comment PredictionHealthBench (test)
Medical Chat Annotation Score7.31
6
Showing 25 of 44 rows