| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| LongBench v2 | Qwen2.5-14B-1M-LongRLVR | Overall Accuracy46.5 | 41 | 3mo ago | |
| Vad-Reasoning-Plus | Vad-R1-Plus | MCQ Score96.4 | 27 | 3mo ago | |
| MMLU professional medicine | GPT-4o | Accuracy94 | 21 | 3mo ago | |
| RefMem-Bench | REMIND | Accuracy59.4 | 14 | 1d ago | |
| C3 | GRASP | Accuracy44.6 | 8 | 3mo ago | |
| Weather Reasoning MCQA-L | TimeClaw | Accuracy66.4 | 7 | 22d ago | |
| Weather Reasoning MCQA-S | TimeClaw | Accuracy61.6 | 7 | 22d ago | |
| M3GIA | Accuracy59.8 | 5 | 3mo ago | ||
| MMBench (test dev) | Accuracy86.4 | 5 | 3mo ago | ||
| SEED-IMG | Accuracy76.5 | 4 | 3mo ago |