| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| General Language Evaluation Suite AE, AC, SciQ, MMLU, MMLU-P, HS, OBQA, PIQA, RACE, WG, CSQA, AGI (test) | ROOT | AE Score69.95 | 27 | 14d ago | |
| MMLU, GSM8k, HellaSwag, WinoGrande | MMLU Accuracy72.98 | 17 | 3mo ago | ||
| LM Evaluation Harness | SEFT | Accuracy60.35 | 16 | 1mo ago | |
| Open LLM Leaderboard | ARC70.22 | 14 | 3mo ago | ||
| TinyStories | go-mHC | Grammar6.63 | 5 | 2mo ago | |
| Eight benchmark LLM tasks | Heterogeneous Digital-AIMC framework | Throughput (Tokens/s)49,781.23 | 5 | 3mo ago | |
| Bolmo 1B evaluation suite | BLT 1B | Overall Average Score58.5 | 5 | 3mo ago | |
| ARC, HellaSwag, MMLU, TruthfulQA, WinoGrande | BOFT | ARC Accuracy34.64 | 4 | 3mo ago |