| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Reasoning Evaluation Suite (GSM8K, MATH500, AIME24, HumanEval) (test) | SafeChain | GSM8K Accuracy94.7 | 36 | 1mo ago | |
| Reasoning and Code Generation Suite (MATH, GSM8K, MBPP, TheoremQA, BBH) (test) | FlexSwitch | MATH Accuracy54.38 | 6 | 1mo ago |