| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MIMIC-IV diagnostic evaluation set (test) | GLEAN (N=3) | Accuracy78.33 | 54 | 1mo ago | |
| agent-CMB | Medical-CoT* | Rounds18.34 | 25 | 1mo ago | |
| MedQA agent | MedKGI | Rounds9.11 | 25 | 1mo ago | |
| MedEinst Robust 1.0 | ECR-Agent (Qwen3-32B) | Robust Accuracy24.21 | 18 | 1mo ago | |
| MedEinst Baseline 1.0 | ECR-Agent (Qwen3-32B) | Baseline Accuracy69.49 | 18 | 1mo ago | |
| COVID19-CT | SH-PEFT | F1 Score83 | 16 | 1mo ago | |
| MAU (test) | UMed-LVLM | DL Score53 | 13 | 1mo ago | |
| Step-CoT (test) | Ours (Teacher) | Accuracy78.3 | 10 | 1mo ago | |
| CXR14 (external) | Precision for Edema71.26 | 10 | 1mo ago | ||
| DiagnosisArena (test) | GoS | Match (LLM-as-a-Judge)31.88 | 9 | 25d ago | |
| MediQ (test) | Average Outcome Reward74.67 | 9 | 1mo ago | ||
| NEJM | DDO | Rounds17.91 | 9 | 1mo ago | |
| DeepLesion | MedRoute | Accuracy45.52 | 8 | 9d ago | |
| PMC-VQA | MedRoute | Accuracy59.28 | 8 | 9d ago | |
| MD DX | GoT | Worst Case Interaction Length10.5 | 8 | 1mo ago | |
| MD DX weighted (test) | Worst-case Weighted Payoff126.4 | 8 | 1mo ago | ||
| VeriSim Noisy Levels 1-3 | Qwen-2.5-72B | Top-1 Accuracy69.2 | 7 | 4d ago | |
| VeriSim Clean, Level 0 | Qwen-2.5-72B | Top-1 Accuracy84.5 | 7 | 4d ago | |
| MedQA | MedRoute | Accuracy88.76 | 6 | 9d ago | |
| DiagnosisArena | Pass@145.57 | 4 | 1mo ago |