| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ScienceQA | SceneAlign | Accuracy92.72 | 17 | 4d ago | |
| Bamboogle auto-eval (test) | Self-improvement, 2nd gen | Mean Accuracy76.1 | 10 | 4d ago | |
| GSM8K (test) | SLR | Pass@132.2 | 9 | 4d ago | |
| CLEVR-Puzzle (test) | NeSyCoCo | Accuracy95 | 7 | 4d ago | |
| BamTwoogle (test) | ReST meets ReAct | Accuracy74 | 4 | 4d ago | |
| Bamboogle (test) | ReST meets ReAct | Accuracy74.4 | 4 | 4d ago | |
| multi-step reasoning tasks | EmbodiedAct | Average Score61.6 | 3 | 4d ago |