| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Commonsense Reasoning | StrategyQA | Accuracy95.7 | 208 | |
| Performance Estimation | StrategyQA | MAE0 | 197 | |
| Question Answering | StrategyQA | Accuracy94.4 | 123 | |
| Commonsense Reasoning | StrategyQA (test) | Accuracy83.49 | 119 | |
| Question Answering | StrategyQA (test) | Task Accuracy96.7 | 74 | |
| Logical Reasoning | StrategyQA | Accuracy89 | 58 | |
| Reasoning | StrategyQA | Accuracy83.5 | 52 | |
| Multi-hop Reasoning | StrategyQA | Accuracy95.6 | 50 | |
| Question Answering | StrategyQA | Exact Match (EM)85.59 | 35 | |
| Question Answering | StrategyQA | EM80.1 | 35 | |
| Reasoning | StrategyQA (test) | Factuality Acc100 | 28 | |
| Reasoning Question Answering | StrategyQA | Accuracy80 | 26 | |
| Question Answering | StrategyQA | Accuracy90.4 | 26 | |
| Multi-hop Question Answering | StrategyQA (test) | Accuracy77.12 | 26 | |
| Commonsense Reasoning | StrategyQA | Accuracy (%)76.13 | 24 | |
| Calibration | StrategyQA | ECE0.285 | 24 | |
| Question Answering | STRATEGYQA | Accuracy61.8 | 24 | |
| Knowledge-intensive QA | StrategyQA | ACC66.7 | 24 | |
| Question Answering | StrategyQA | EM89.34 | 21 | |
| Commonsense Reasoning | StrategyQA | Accuracy76.84 | 20 | |
| Multi-hop QA | StrategyQA (SQA) | Cover-EM76.95 | 20 | |
| Question Answering | StrategyQA | Exact Match (EM)82 | 16 | |
| Strategy-based Question Answering | StrategyQA | Verifiability69.11 | 16 | |
| Multiple Choice Classification | StrategyQA | Accuracy83.4 | 16 | |
| Reasoning quality evaluation | STRATEGYQA | Somers' D0.2735 | 15 |