| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Zero-shot NLP Evaluation Suite (WikiText2, BoolQ, PIQA, HellaSwag, WinoGrande, ARC, OBQA, MTQA) (test) | SimDiff (MSSD) | WikiText2 Perplexity7.43 | 27 | 1mo ago | |
| Gauntlet 20 benchmarks (test) | Prior-based | Average Normalized Accuracy9.2 | 10 | 3mo ago | |
| MMLU, ARC-C, PIQA, WinoG, GSM8K, HellaSwag, GPQA, RACE zero-shot | Average Score60.94 | 9 | 3mo ago | ||
| DCLM Pro | PathMoE | WinoGrande57.93 | 2 | 2mo ago |