| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMultiHopQA | EM71.7 | 278 | |
| Multi-hop Question Answering | 2WikiMultiHopQA (test) | EM64.6 | 143 | |
| Question Answering | 2WikiMultihopQA | EM47.7 | 73 | |
| Multi-hop Question Answering | 2WikiMultiHopQA Out-Of-Distribution (OOD) | Accuracy74.2 | 72 | |
| Question Answering | 2WikiMultihopQA (test) | F178.9 | 69 | |
| Long-context Question Answering | 2WikiMultiHopQA (Out-Of-Distribution) | Accuracy63.9 | 54 | |
| Multi-hop QA Retrieval | 2WikiMultihopQA (test) | R@280.8 | 28 | |
| Question Answering | 2WikiMultihopQA | Accuracy62.5 | 25 | |
| Multi-hop Question Answering | 2WikiMultiHopQA N=200 | Judge EM77 | 24 | |
| Latent multi-hop reasoning | 2WikiMultiHopQA | Precision96.86 | 22 | |
| Multi-hop Question Answering | 2WikiMultiHopQA Full | Accuracy (C)87.5 | 22 | |
| Question Answering | 2WikiMultihopQA | LLM-Acc89.7 | 20 | |
| End-to-end Question Answering | 2WikiMultiHopQA (test val) | EM35.44 | 20 | |
| Knowledge-Intensive Reasoning | 2wikiMultiHopQA | F1 Score76.1 | 18 | |
| Knowledge-Intensive Reasoning | 2WikiMultiHopQA | Accuracy48.8 | 18 | |
| Multi-hop Question Answering | 2WikiMultiHopQA (2WikiMQA) (official evaluation) | Exact Match (EM)31.8 | 17 | |
| Multi-Hop Question Answering | 2WikiMultiHopQA out-of-domain (val test) | Exact Match (EM)51.7 | 15 | |
| Multi-hop Question Answering | 2WikiMultiHopQA in-domain (test) | Accuracy (Response)69.8 | 14 | |
| Question Answering | 2WikiMultiHopQA December 2018 Wikipedia dump (test) | EM28.6 | 14 | |
| Question Answering | 2WikiMultiHopQA 1,000 queries (test) | EM71.1 | 13 | |
| Question Answering | 2WikiMultihopQA | Prefilling Speedup Ratio3.57 | 12 | |
| Multi-step Retrieval | 2WikiMultihopQA (val) | F1 Score68.02 | 11 | |
| Question Answering | 2WikiMultiHopQA out-domain (test) | LasJ56.32 | 11 | |
| Multi-hop Question Answering | 2WikiMultiHopQA (dev) | Exact Match Accuracy68.6 | 11 | |
| Multi-hop Question Answering | 2WikiMultiHopQA v1.0 (test) | Task Latency (s)2.29 | 9 |