Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HealthBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Medical ReasoningHealthBench Hard
Accuracy40.74
41
Health-related dialogue and decision-makingHealthBench Main
Average Score46.38
22
Medical Question AnsweringHealthBench Hard
Accuracy39.02
19
Long-horizon agentic taskHealthbench Hard
Performance28.06
18
Medical Question AnsweringHealthBench Overall
Overall Score60.1
16
MedicalHealthBench-500
Score43.6
15
LLM EvaluationHealthBench (test)
HealthBench Score (%)62.6
11
Health-domain instruction followingHealthBench 1K-example (eval)
Score78.8
8
Evaluation Criteria GenerationHealthBench
Coverage90
6
Evaluation Criterion GenerationHealthBench
Specificity10
6
Model Selection EvaluationHealthBench
Actual (per type)90.5
5
Medical KnowledgeHealthBench
Score47.45
5
Question AnsweringHealthBench 500-conversation (out-of-domain)
HealthBench Score0.649
5
Medical ReasoningHealthbench Hard
Pass Rate40.91
4
Medical Question AnsweringHealthBench (All Set)
Overall Score58.56
4
Medical Question AnsweringHealthBench normal
Pass@165.2
4
Hallucination DetectionHealthBench (test)
AUC96.48
4
Medical Response RefinementHealthBench 254 medical queries
Base Score59
4
Hallucination SuppressionHealthBench Hallu
Refuted Rate2.37
4
Medical ReasoningHealthBench
HealthBench Score66.2
4
Clinical Intent AlignmentHealthBench
CIA60.12
3
Clinical Question AnsweringHealthBench Hard Set
Overall Score0.3861
2
Showing 22 of 22 rows