| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMultiHopQA | EM82.1 | 559 | |
| Multi-hop Question Answering | 2WikiMultiHopQA (test) | EM73.9 | 226 | |
| Question Answering | 2WikiMultihopQA (test) | F178.9 | 113 | |
| Question Answering | 2WikiMultihopQA | EM47.7 | 107 | |
| Multi-hop Question Answering | 2WikiMultiHopQA Out-Of-Distribution (OOD) | Accuracy74.2 | 72 | |
| Open-domain Question Answering | 2WikiMultiHopQA in-domain | F1 Score62.6 | 57 | |
| Long-context Question Answering | 2WikiMultiHopQA (Out-Of-Distribution) | Accuracy63.9 | 54 | |
| Question Answering | 2WikiMultiHopQA | Exact Match43 | 50 | |
| Knowledge Retrieval | 2WikiMultihopQA | F1 Score56.46 | 45 | |
| Multi-hop Question Answering | 2WikiMultiHopQA | String Accuracy70.3 | 44 | |
| Multi-hop Question Answering | 2WikiMultiHopQA (val) | Exact Match (EM)69.3 | 44 | |
| Multi-hop QA Retrieval | 2WikiMultihopQA (test) | R@597.2 | 33 | |
| Question Answering | 2WikiMultihopQA LongBench | F1 Score59.73 | 32 | |
| Multi-hop Question Answering | 2WikiMultiHopQA | Token F1 Score65.9 | 30 | |
| Reasoning | 2WikiMultiHopQA (OOD) | Degeneration Count0 | 27 | |
| Question Answering | 2WikiMultiHopQA (OOD) | Exact Match (EM)2.21 | 27 | |
| Question Answering | 2WikiMultihopQA | EM36.8 | 27 | |
| Question Answering | 2WikiMultihopQA | Accuracy62.5 | 25 | |
| Multi-hop Question Answering | 2WikiMultiHopQA online Google Search API (test val) | Exact Match63.5 | 24 | |
| Multi-hop Question Answering | 2WikiMultiHopQA offline Wiki-18 (test val) | Exact Match43.6 | 24 | |
| Multi-hop Question Answering | 2WikiMultiHopQA N=200 | Judge EM77 | 24 | |
| Knowledge composition selection | 2WikiMultihopQA | Precision @ K=2100 | 23 | |
| Latent multi-hop reasoning | 2WikiMultiHopQA | Precision96.86 | 22 | |
| Multi-hop Question Answering | 2WikiMultiHopQA Full | Accuracy (C)87.5 | 22 | |
| Retrieval | 2WikiMultiHopQA v1 (test) | R@2E85 | 21 |