| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| O2M Benign Clinical Queries | H-R Demon | RR30.56 | 18 | 6d ago | |
| Aggregate (IFEval, GPQA, LCB, Arena-Hard, CW, MT-Bench, WildBench) | SPARD | Average Score63.69 | 14 | 9d ago | |
| Performance Bench Aggregate | DeepSeek-R1-Distill-Qwen-32B (Reasoning) | Average Score82.49 | 9 | 1mo ago |