| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | MuSR 0-shot | Reasoning Score (0-shot)48.82 | 46 | |
| Multistep Reasoning | MUSR | Accuracy61.67 | 31 | |
| Math & Logic | MUSR | MUSR Performance42.12 | 24 | |
| Reasoning | MuSR | Accuracy71.89 | 20 | |
| Reasoning | MuSR (test) | Accuracy73.9 | 14 | |
| Multistep Soft Reasoning | MUSR | Accuracy (%)43.1 | 12 | |
| Multi-hop Reasoning | MuSR | Accuracy43.12 | 10 | |
| Multistep Soft Reasoning | MuSR | Accuracy69 | 9 | |
| Reasoning | MuSR | MuSR Score37.14 | 9 | |
| Self-doubt detection | MuSR 90-trace | AUROC (Self-doubt)83.66 | 7 | |
| Adding Mistake | MuSR | AOC0.731 | 7 | |
| Truncated CoT Answering | MuSR | AOC33.6 | 7 | |
| Multistep Reasoning | MUSR-fr | Average Score33.79 | 6 | |
| Multistep Reasoning | MuSR | Accuracy41.5 | 3 | |
| Multi-step reasoning and knowledge retrieval | MuSR (test) | Accuracy0.7867 | 1 |