| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Weather | MemPrompt | Align83.22 | 40 | 4d ago | |
| IFEval strict prompt | Nemotron Cascade-8B | pass@190.2 | 16 | 4d ago | |
| UltraFeedback (test) | MetaAligner-13B | IF Score68.5 | 11 | 4d ago | |
| IFEval | Qwen2.5-VL-72B | IFEval Score86.3 | 10 | 4d ago | |
| IFBench | Nemotron-Cascade 14B-Thinking | pass@141.7 | 7 | 4d ago | |
| ArenaHard | pass@195.7 | 7 | 4d ago | ||
| PDDLLM v1 (test) | Planning Success Rate100 | 6 | 4d ago | ||
| Berlin2-10 real (test) | GVINS | MAE1.66 | 5 | 4d ago | |
| Berlin2 real (test) | SDP | MAE1.29 | 5 | 4d ago | |
| Berlin1-10 real (test) | SDP | MAE4.53 | 5 | 4d ago | |
| Berlin real 1 (test) | SDP | MAE2.64 | 5 | 4d ago | |
| Arena-Hard | Qwen2-72B-Instruct | Score48.1 | 5 | 4d ago | |
| MixEval | Qwen2-72B-Instruct | Score86.7 | 5 | 4d ago | |
| MT-Bench | Qwen2-72B-Instruct | MT-Bench Score9.12 | 5 | 4d ago | |
| AlignBench v1 (test) | Qwen2-7B | Score7.21 | 5 | 4d ago | |
| MT-Bench v1 (test) | Qwen2-7B | MT-Bench Score8.41 | 5 | 4d ago | |
| Arena-Hard | SnapMLA | Hard Prompt Gemini Score70.4 | 4 | 4d ago | |
| AlignBench | Qwen2-72B-Instruct | Score8.27 | 4 | 4d ago | |
| MixEval v1 (test) | Qwen2-7B | Accuracy76.5 | 4 | 4d ago |