| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| IFEval | Qwen3-235B-A22B-Instruct | Accuracy (0-100)94 | 292 | 2d ago | |
| AlpacaEval 2.0 | STEPS | LC Win Rate3,526 | 281 | 3d ago | |
| MT-Bench | GPT-4-1106-preview | MT-Bench Score9.32 | 189 | 3d ago | |
| AlpacaEval | BFPO | Win Rate97.2 | 125 | 3d ago | |
| DollyEval | IOA | Score42.39 | 106 | 2d ago | |
| UnNI | MINILLM | Rouge-L40.2 | 94 | 3d ago | |
| S-NI | Adversarial Moment-Matching Distillation | Rouge-L38.7 | 94 | 3d ago | |
| Natural Instructions (test) | CoLoRA | Rouge-L97.9 | 90 | 3d ago | |
| ALFWorld | M2CL | Accuracy89.3 | 82 | 4d ago | |
| AdvancedIF | BRAID | Accuracy71 | 81 | 4d ago | |
| VicunaEval | IOA | VicunaEval Score40.75 | 80 | 2d ago | |
| Arena Hard | DeepSeek-V3 | Win Rate94.9 | 77 | 3d ago | |
| VicunaEval | Goal Prioritization | Rouge-L35 | 72 | 3d ago | |
| AlpacaEval 2.0 (test) | Muon-8L/AdamW-32 | LC Win Rate (%)59.93 | 71 | 3d ago | |
| IFBench | Olmo 3.1 Think 32B | Pass@1 (Strict)68.1 | 68 | 3d ago | |
| Alpaca | EAGLE3 | Speedup (x)4.13 | 63 | 3d ago | |
| MT-Bench zh | Qwen2.5-14B-SFT-TaP | Score6.83 | 60 | 3d ago | |
| AlignBench | Qwen2.5-14B-SFT-TaP | Reasoning Score7.42 | 60 | 3d ago | |
| SelfInst | Adversarial Moment-Matching Distillation | Rouge-L21.7 | 57 | 3d ago | |
| ReasonIF synthesized v1.0 | FLEx | IFS96.3 | 55 | 3d ago | |
| MT-bench v1.0 (test) | CaR | MT-Bench Score61.2 | 52 | 3d ago | |
| SelfInst | R-L Score23.4 | 50 | 3d ago | ||
| IFEval (test) | Qwen3-Omni-Instruct | IFEval Score81.17 | 45 | 3d ago | |
| AlpacaFarm (test) | Reward Score387.196 | 40 | 3d ago | ||
| BBH | Baseline Step 0 | Accuracy67.1 | 40 | 4d ago |