| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Creative Writing Evaluation Prompts | Min-p | Average Judge Score8.12 | 108 | 1mo ago | |
| AlpacaEval 2.0 | LLAMA-2-CHAT | Win Rate648 | 43 | 1mo ago | |
| CNN DailyMail | ConfAdapt | ROUGE-L24.3 | 40 | 1mo ago | |
| TruthfulQA Without Rejected Samples open-ended (full) | CoCoASIG | Truthfulness74.67 | 39 | 1mo ago | |
| TruthfulQA With All Samples open-ended (full) | DoLa | Truthfulness82.75 | 39 | 1mo ago | |
| WildBench | WildBench0.479 | 26 | 1mo ago | ||
| Wikitext-103 (test) | DITTO | MAUVE0.96 | 26 | 1mo ago | |
| AlpacaEval 1.0 | LLAMA-2-CHAT | Win Rate7,904 | 23 | 1mo ago | |
| SciQ | GrACE | ECE5.21 | 21 | 10d ago | |
| TriviaQA | ActCab | ECE5.18 | 21 | 10d ago | |
| HelloBench (HB) | HB-A Score84 | 17 | 4d ago | ||
| WildBench (test) | Qwen3 | WildBench Score64.4 | 17 | 1mo ago | |
| TruthfulQA Open-ended | ITI | True Score99.6 | 16 | 1mo ago | |
| Arena-Hard | AR-MAP | Score84.6 | 14 | 1mo ago | |
| NQ entity-swapped (test) | HICD | Exact Match73.73 | 12 | 1mo ago | |
| XSum 1,000 samples (test) | DoLA | ROUGE-L23.11 | 12 | 1mo ago | |
| LLaVA-Bench In-the-Wild | PM | Ref Score62.46 | 11 | 8d ago | |
| LLaVA-Bench COCO | PM | Reference Score85.76 | 11 | 8d ago | |
| TruthfulQA open-ended | RAb | BLEU51.2 | 10 | 1mo ago | |
| HumanEval+ | SGR | FDR0.44 | 9 | 1mo ago | |
| Zebra-Logic | SGR | FDR0.75 | 9 | 1mo ago | |
| MATH L5 | SGR | FDR0.35 | 9 | 1mo ago | |
| MATH500 | SGR | FDR2.5 | 9 | 1mo ago | |
| MMLU-Redux | FDR (%)4.22 | 9 | 1mo ago | ||
| Finance | TF-TTCL | ROUGE-Lsum29.19 | 8 | 2d ago |