| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Medical Reasoning | HealthBench Hard | Accuracy40.74 | 41 | |
| Health-related dialogue and decision-making | HealthBench Main | Average Score46.38 | 22 | |
| Medical Question Answering | HealthBench Hard | Accuracy39.02 | 19 | |
| Long-horizon agentic task | Healthbench Hard | Performance28.06 | 18 | |
| Medical Question Answering | HealthBench Overall | Overall Score60.1 | 16 | |
| Medical | HealthBench-500 | Score43.6 | 15 | |
| LLM Evaluation | HealthBench (test) | HealthBench Score (%)62.6 | 11 | |
| Health-domain instruction following | HealthBench 1K-example (eval) | Score78.8 | 8 | |
| Evaluation Criteria Generation | HealthBench | Coverage90 | 6 | |
| Evaluation Criterion Generation | HealthBench | Specificity10 | 6 | |
| Model Selection Evaluation | HealthBench | Actual (per type)90.5 | 5 | |
| Medical Knowledge | HealthBench | Score47.45 | 5 | |
| Question Answering | HealthBench 500-conversation (out-of-domain) | HealthBench Score0.649 | 5 | |
| Medical Reasoning | Healthbench Hard | Pass Rate40.91 | 4 | |
| Medical Question Answering | HealthBench (All Set) | Overall Score58.56 | 4 | |
| Medical Question Answering | HealthBench normal | Pass@165.2 | 4 | |
| Hallucination Detection | HealthBench (test) | AUC96.48 | 4 | |
| Medical Response Refinement | HealthBench 254 medical queries | Base Score59 | 4 | |
| Hallucination Suppression | HealthBench Hallu | Refuted Rate2.37 | 4 | |
| Medical Reasoning | HealthBench | HealthBench Score66.2 | 4 | |
| Clinical Intent Alignment | HealthBench | CIA60.12 | 3 | |
| Clinical Question Answering | HealthBench Hard Set | Overall Score0.3861 | 2 |