| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ARC-c, ARC-e, WinoGrande, BoolQ, HellaSwag, OpenBookQA, PIQA, MMLU standard (test val) | Average Accuracy0.7361 | 88 | 4d ago | ||
| LM-EVAL (Average of HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande) zero-shot latest | Average Accuracy76 | 30 | 4d ago | ||
| ARC-e, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SocialIQA, SciQ, SWAG | JREG | ARC-e Accuracy77.9 | 12 | 4d ago | |
| Downstream Tasks zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande) | RS | Arc-c49.32 | 6 | 4d ago | |
| Downstream Evaluation Suite (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ) | BHyT | Arc-e53.83 | 4 | 4d ago |