| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BBH (BIG-Bench Hard) | PREFPO | Average BBH Score87.5 | 20 | 20d ago | |
| TruthfulQA | Accuracy40.15 | 12 | 2mo ago | ||
| Language Reasoning Average | Accuracy73.25 | 12 | 2mo ago | ||
| DeepAccident-CCoT (val) | C-CoT | Accuracy84.2 | 6 | 22d ago | |
| LangR unseen tasks (test) | SGE | Pass@160.8 | 3 | 3mo ago |