| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Clinical Task Execution | MedAgentBench OOD v2 | Accuracy87.1 | 35 | |
| Clinical Task Execution | MedAgentBench v2 (test) | Accuracy76.9 | 35 | |
| Clinical Task Execution | MedAgentBench v2 (val) | Accuracy77 | 35 | |
| Clinical Task Execution | MedAgentBench OOD | Accuracy80.6 | 35 | |
| Clinical Task Execution | MedAgentBench (test) | Accuracy88.8 | 35 | |
| Clinical Task Execution | MedAgentBench (val) | Accuracy86.2 | 35 | |
| Medical Agent Task Execution | MedAgentBench | Success Rate79.3 | 24 | |
| Multi-agent recommendation | MedAgentBench | Top-1 Acc100 | 4 | |
| Single-agent tool selection | MedAgentBench | Top-1 Accuracy99 | 4 | |
| Medical Agentic Reasoning | MedAgentBench | Accuracy87 | 3 |