| Creative Writing Evaluation Prompts | Min-p | Average Judge Score8.12 | | 108 | 3d ago |
| AlpacaEval 2.0 | LLAMA-2-CHAT | Win Rate648 | | 43 | 3d ago |
| CNN DailyMail | ConfAdapt | ROUGE-L24.3 | | 40 | 3d ago |
| TruthfulQA Without Rejected Samples open-ended (full) | CoCoASIG | Truthfulness74.67 | | 39 | 3d ago |
| TruthfulQA With All Samples open-ended (full) | DoLa | Truthfulness82.75 | | 39 | 3d ago |
| Wikitext-103 (test) | DITTO | MAUVE0.96 | | 26 | 3d ago |
| AlpacaEval 1.0 | LLAMA-2-CHAT | Win Rate7,904 | | 23 | 3d ago |
| TruthfulQA Open-ended | ITI | True Score99.6 | | 16 | 3d ago |
| Arena-Hard | AR-MAP | Score84.6 | | 14 | 3d ago |
| NQ entity-swapped (test) | HICD | Exact Match73.73 | | 12 | 3d ago |
| XSum 1,000 samples (test) | DoLA | ROUGE-L23.11 | | 12 | 3d ago |
| TruthfulQA open-ended | RAb | BLEU51.2 | | 10 | 3d ago |
| HumanEval+ | SGR | FDR0.44 | | 9 | 3d ago |
| Zebra-Logic | SGR | FDR0.75 | | 9 | 3d ago |
| MATH L5 | SGR | FDR0.35 | | 9 | 3d ago |
| MATH500 | SGR | FDR2.5 | | 9 | 3d ago |
| MMLU-Redux | | FDR (%)4.22 | | 9 | 3d ago |
| COCO 2014 (val) | LogicCheckGPT | Accuracy8.58 | | 8 | 3d ago |
| Open-ended generation tasks (Human Evaluation) | STA | Quality Score4.4 | | 7 | 3d ago |
| TruthfulQA double info 1.0 (test) | TACS-T | True Score68.4 | | 7 | 3d ago |
| TruthfulQA single info 1.0 (test) | TACS-T | Truthfulness Score66.6 | | 7 | 3d ago |
| C4 RealNews | | Perplexity3.4 | | 4 | 2d ago |
| NQ-Swap | | EM44.91 | | 4 | 3d ago |
| NQ | CoCoASIG | Exact Match (EM)0.512 | | 4 | 3d ago |
| RealToxicityPrompts Non-toxic prompts (test) | PaLM 2 (L) | Toxicity Probability7.38 | | 4 | 3d ago |