| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Pooled tasks Table 5 Llama-3.1 3.3 (various) | Llama-3.3 70B Instruct | Pooled Accuracy Estimate (γ̂)57.15 | 21 | 4d ago | |
| Open LLM Leaderboard v2 (test) | BBH60.84 | 20 | 4d ago | ||
| AdaptEval | SCALENET (Layer-wise) | ROUGE-Lsum0.2733 | 14 | 4d ago | |
| Downstream Tasks Evaluation Suite Math, Code, Law, Know., Reason., MMLU | Math Accuracy4.92 | 9 | 4d ago | ||
| Quality, Factuality, and Safety Evaluation Suite (test) | Self-Improving Pretraining | Generation Quality Score86.3 | 7 | 4d ago | |
| NLP Evaluation Suite (WG, PIQA, BoolQ, ARC-C, ARC-E, OBQA, HS, SciQ, LM, RTE) | QK sharing | WG60.14 | 6 | 4d ago | |
| 1.3B LLM Leaderboard | QA+C4-85B | ARC36.4 | 5 | 4d ago |