| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMQA | F1 Score76.4 | 161 | |
| Question Answering | 2WikiMQA | F174.9 | 44 | |
| Long-context Question Answering | 2WikiMQA | SubEM79.5 | 36 | |
| Multi-hop Reasoning | 2WikiMQA IRCoT 500 samples (test) | ACC52.8 | 27 | |
| Multimodal Question Answering | 2WikiMQA | F1-Recall55.47 | 22 | |
| Long-context Question Answering | 2WikiMQA (Passage Split) | Score52.53 | 18 | |
| Long-context Question Answering | 2WikiMQA Fixed Chunk 2048 | QA Score52.53 | 18 | |
| Question Answering | 2WikiMQA (test) | EM35.9 | 18 | |
| Retrieval | 2WikiMQA (test) | Recall@K69.7 | 8 | |
| Multi-hop Question Answering | 2WikiMQA (test) | Exact Match48.6 | 7 | |
| Question Answering | 2WikiMQA (sampled) | Accuracy0.63 | 4 |