| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Medical Reasoning | HealthBench Hard | Accuracy40.74 | 41 | |
| Medical Question Answering | HealthBench Medicine N=5,000 (overall) | Rubric Score26.1 | 36 | |
| Medical Reasoning | HealthBench | Accuracy70.41 | 36 | |
| Health-related dialogue and decision-making | HealthBench Main | Average Score46.38 | 24 | |
| Medical Question Answering | HealthBench Hard | Accuracy39.02 | 19 | |
| Long-horizon agentic task | Healthbench Hard | Performance28.06 | 18 | |
| Medical and Health Knowledge | HealthBench | Accuracy37.2 | 17 | |
| Health Dialogue | HealthBench | Accuracy44.92 | 17 | |
| Deep Research | HealthBench | Score59.5 | 17 | |
| Medical Question Answering | HealthBench Overall | Overall Score60.1 | 16 | |
| Health Multimodal Evaluation | HealthBench English (test) | Overall Score32.7 | 15 | |
| Treatment planning | HealthBench treatment-related conversations | Overall Score48.94 | 15 | |
| Medical | HealthBench-500 | Score43.6 | 15 | |
| Long-form research | HealthBench | Overall Score59.5 | 14 | |
| Medical Question Answering | HealthBench Hard 1000 | Accuracy86 | 12 | |
| Long-form deep-research answering | HealthBench | Score54 | 11 | |
| Open-ended Medical Consultation | HealthBench Hard | Total Score46.2 | 11 | |
| Clinical Reasoning | HealthBench Professional (525 cases) | Overall Score62.72 | 11 | |
| Medical Knowledge | HealthBench | Pass@192.82 | 11 | |
| LLM Evaluation | HealthBench (test) | HealthBench Score (%)62.6 | 11 | |
| Medical Question Answering | HealthBench Medical | Score56.36 | 10 | |
| Health-related evaluation | HealthBench | HealthBench Score43 | 9 | |
| Health-domain instruction following | HealthBench 1K-example (eval) | Score78.8 | 8 | |
| Medical Question Answering | HealthBench Professional | Score62.7 | 7 | |
| Comment Prediction | HealthBench (test) | Medical Chat Annotation Score7.31 | 6 |