| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning (Coding Tools) | BB-Hard | Accuracy63.33 | 25 | |
| Mathematical Reasoning (Coding Tools) | BB Easy | Accuracy95.12 | 25 | |
| Abstention in Question Answering | BB Answer Unknown | Abstention F197.9 | 10 | |
| Agent-task matching | BB NonIID | Cumulative Alignment Cost410.04 | 4 |