| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MuSiQue | GPT-4o-0806 | EM53 | 41 | 4d ago | |
| StrategyQA | OpenMath2-Llama3.1-70B* | Accuracy95.6 | 32 | 2d ago | |
| 2WikiMQA IRCoT 500 samples (test) | ActiShade | ACC52.8 | 27 | 4d ago | |
| HotpotQA IRCoT (500 samples) (test) | ActiShade | ACC54.6 | 27 | 4d ago | |
| MuSiQue IRCoT 500 samples (test) | ActiShade | ACC25.59 | 27 | 4d ago | |
| 2WikiMHQA | CoT-UQ | AUROC0.7002 | 26 | 4d ago | |
| HotpotQA | CoT-UQ | AUROC67.19 | 26 | 4d ago | |
| CommaQA-E compositional | ChatGPT (SKiC) | Exact Match80.8 | 15 | 2d ago | |
| CommaQA-E (test) | ChatGPT (SKiC) | Exact Match70 | 15 | 2d ago | |
| MultiHopRAG | Qwen2.5-OpAmp-72B | EM89.6 | 11 | 4d ago | |
| MuSR | Accuracy43.12 | 10 | 4d ago | ||
| LongBench MuSiQue and WikiMultiHopQA | MGRS | F1 Score69.9 | 7 | 4d ago | |
| Multi-hop reasoning tasks T2 L ≈ 9 steps | ITR | API Success Rate79 | 4 | 4d ago | |
| MuSiQue (test) | ETGPO | Mean Accuracy77.3 | 4 | 4d ago | |
| Musique, HotpotQA, 2Wiki, and Bamboogle 3-hop and above | SIGHT | EM31.98 | 3 | 4d ago | |
| MuSiQue (val test) | ETGPO | Token Usage (Optimization Phase)2,849 | 3 | 4d ago |