| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Creative Writing Evaluation Prompts | Min-p | Average Judge Score8.12 | 108 | 3mo ago | |
| TruthfulQA | FineSteer | BLEURT Score70.13 | 48 | 1mo ago | |
| AlpacaEval 2.0 | LLAMA-2-CHAT | Win Rate648 | 43 | 3mo ago | |
| CNN DailyMail | ConfAdapt | ROUGE-L24.3 | 40 | 3mo ago | |
| TruthfulQA Without Rejected Samples open-ended (full) | CoCoASIG | Truthfulness74.67 | 39 | 3mo ago | |
| TruthfulQA With All Samples open-ended (full) | DoLa | Truthfulness82.75 | 39 | 3mo ago | |
| TriviaQA | ActCab | ECE5.18 | 37 | 1d ago | |
| MLLMU-Bench (Retain Set) | PO | ROUGE-L53.1 | 30 | 1d ago | |
| MLLMU-Bench (test) | ROUGE-L34.5 | 30 | 1d ago | ||
| WildBench | WildBench0.479 | 26 | 2mo ago | ||
| Wikitext-103 (test) | DITTO | MAUVE0.96 | 26 | 3mo ago | |
| AlpacaEval 1.0 | LLAMA-2-CHAT | Win Rate7,904 | 23 | 3mo ago | |
| LLaVA-Bench | LACING | GPT-4 Score84.3 | 21 | 5d ago | |
| CARE-pro | GRPO | Score (Seen)19.75 | 21 | 8d ago | |
| SciQ | GrACE | ECE5.21 | 21 | 1mo ago | |
| Vicuna | Skywork Reward V2 Score99.1 | 18 | 1mo ago | ||
| Dolly | Distillable | Skywork Reward V2 Score0.961 | 18 | 1mo ago | |
| HelloBench (HB) | HB-A Score84 | 17 | 1mo ago | ||
| WildBench (test) | Qwen3 | WildBench Score64.4 | 17 | 2mo ago | |
| TruthfulQA Open-ended | ITI | True Score99.6 | 16 | 3mo ago | |
| MM-Vet | MM-Vet Score45.55 | 14 | 14d ago | ||
| LLaVA-Bench In-the-Wild | Score109.3 | 14 | 14d ago | ||
| Arena-Hard | AR-MAP | Score84.6 | 14 | 3mo ago | |
| NQ entity-swapped (test) | HICD | Exact Match73.73 | 12 | 3mo ago | |
| XSum 1,000 samples (test) | DoLA | ROUGE-L23.11 | 12 | 3mo ago |