| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Medical Question Answering | HealthBench Overall | Overall Score60.1 | 16 | |
| Medical Question Answering | HealthBench Hard | Score34.7 | 16 | |
| Model Selection Evaluation | HealthBench | Actual (per type)90.5 | 5 | |
| Medical Knowledge | HealthBench | Score47.45 | 5 | |
| Question Answering | HealthBench 500-conversation (out-of-domain) | HealthBench Score0.649 | 5 | |
| Medical Question Answering | HealthBench normal | Pass@165.2 | 4 | |
| Hallucination Detection | HealthBench (test) | AUC96.48 | 4 | |
| Medical Response Refinement | HealthBench 254 medical queries | Base Score59 | 4 | |
| Hallucination Suppression | HealthBench Hallu | Refuted Rate2.37 | 4 | |
| Medical Reasoning | HealthBench | HealthBench Score66.2 | 4 | |
| Clinical Intent Alignment | HealthBench | CIA60.12 | 3 |