| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| DS-Avg 9 downstream tasks suite | ARC-c Accuracy63.6 | 39 | 19d ago | ||
| LM Evaluation Harness MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, CommonsenseQA | Uni-DPO | MMLU70.5 | 19 | 8d ago | |
| LM Evaluation Harness MMLU, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, GSM8K standard | MMLU65.8 | 16 | 3mo ago | ||
| OpenLLM Leaderboard v1 (test) | SelectiveDPO | MMLU (5-shot)63.95 | 14 | 3mo ago | |
| Multiple Downstream Datasets (LAMBADA, ARC, WinoGrande, PIQA, HellaSwag, SciQ, RACE) | LAMBADA (OpenAI)45 | 12 | 3mo ago | ||
| Downstream Tasks Aggregate | MeSH | Accuracy60.49 | 11 | 1mo ago | |
| 10 Downstream Tasks | MeSH | Average Accuracy52.79 | 9 | 1mo ago | |
| 15 Downstream Tasks summary | MPP-B | Median EG2 | 7 | 3mo ago | |
| Downstream Suite (BoolQ, PIQA, HS, WG, ARC-e, ARC-c, OBQA) Zero-shot | LLaMA2 | Accuracy (BoolQ)77.7 | 5 | 15d ago | |
| ARC Challenge, BoolQ, OpenbookQA, GSM8K (Strict), MMLU | ARC Challenge Accuracy66.72 | 5 | 20d ago | ||
| Downstream | FairyFuse | Throughput (tokens/s)32.43 | 4 | 1mo ago | |
| MNLI, SCIQ, LAMBADA, HellaSwag, ARC, MMLU | FusedKV | MNLI Acc0.3852 | 2 | 3mo ago |