| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Commonsense Reasoning | StrategyQA | Accuracy90.4 | 125 | |
| Question Answering | StrategyQA | Accuracy94.4 | 114 | |
| Commonsense Reasoning | StrategyQA (test) | Accuracy83.49 | 81 | |
| Logical Reasoning | StrategyQA | Accuracy89 | 58 | |
| Question Answering | StrategyQA | EM80.1 | 35 | |
| Multi-hop Reasoning | StrategyQA | Accuracy95.6 | 32 | |
| Reasoning | StrategyQA (test) | Factuality Acc100 | 28 | |
| Question Answering | StrategyQA (test) | Task Accuracy83 | 28 | |
| Multi-hop Question Answering | StrategyQA (test) | Accuracy77.12 | 26 | |
| Calibration | StrategyQA | ECE0.285 | 24 | |
| Question Answering | STRATEGYQA | Accuracy61.8 | 24 | |
| Knowledge-intensive QA | StrategyQA | ACC66.7 | 24 | |
| Question Answering | StrategyQA | EM89.34 | 21 | |
| Multi-hop QA | StrategyQA (SQA) | Cover-EM76.95 | 20 | |
| Strategy-based Question Answering | StrategyQA | Verifiability69.11 | 16 | |
| Multiple Choice Classification | StrategyQA | Accuracy83.4 | 16 | |
| Question Answering | StrategyQA | Accuracy84 | 14 | |
| Strategic Question Answering | StrategyQA | Reusability Score51.57 | 12 | |
| Question Answering | StrategyQA | Prefilling Speedup Ratio3.39 | 12 | |
| Follow-up Questioning Consistency | StrategyQA (unseen) | Average Success Count (M.)42.65 | 12 | |
| Reasoning | StrategyQA | Accuracy83.5 | 10 | |
| Retrieval-Augmented Generation Evaluation | StrategyQA 100-query benchmark | Mean Score69.8 | 10 | |
| Question Answering | StrategyQA | Precision65.9 | 9 | |
| Question Answering | StrategyQA | ECE0.217 | 8 | |
| Question Answering | StrategyQA | Factuality59.6 | 8 |