| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| DROP | GPT-4o + QuaSAR | Accuracy88.9 | 33 | 1mo ago | |
| GSM8K, MATH, SVAMP, ASDiv, MAWPS, CARP | Average Score82.5 | 29 | 1mo ago | ||
| HELM | Synth. Reason. (AS)54 | 16 | 1mo ago | ||
| BoolQ, ARC-e, ARC-c, WinoGrande (WinoG), HellaSwag (HelloS) | MoEITS | BoolQ Accuracy75.2 | 4 | 4d ago | |
| Big-GSM | TCR | Accuracy54.4 | 4 | 1mo ago |