| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | 2WIKI | F172.37 | 75 | |
| Multi-hop Question Answering | 2Wiki | F1 Score70.58 | 41 | |
| Multi-hop QA | 2Wiki | EM62 | 26 | |
| Multi-hop Question Answering | 2Wiki (test) | F1 Score69.7 | 20 | |
| Question Answering | 2Wiki 30K context | Accuracy73.7 | 19 | |
| Question Answering | 2Wiki 10K context | Accuracy72.2 | 19 | |
| Question Answering | 2Wiki 100K context | Accuracy65.5 | 18 | |
| Multi-Hop QA Verification | 2wiki | P@181.21 | 16 | |
| Question Answering | 2WIKI (val) | EM27.2 | 14 | |
| Question Answering | 2WIKI (out-of-domain) | EM40 | 14 | |
| Question Answering | 2Wiki 500 samples (val) | EM39.6 | 14 | |
| Query Rewriting & QA | 2Wiki BM25 | F133.6 | 12 | |
| Open-domain Question Answering | 2WIKI | Accuracy48.9 | 11 | |
| Multi-Hop Question Answering | 2Wiki (out-of-domain) | Accuracy42 | 10 | |
| Expected Calibration Error | 2Wiki | ECE17.43 | 10 | |
| Multi-Hop QA | 2Wiki (test) | EM57.5 | 10 | |
| Question Answering | 2Wiki Normal | F1 Score23.63 | 8 | |
| Deep Research | 2WIKI (test) | Mean Correct Rate0.92 | 8 | |
| Question Answering | 2Wiki Extreme | F1 Score26.99 | 7 | |
| Question Answering | 2Wiki Noisy | F1 Score24.17 | 7 | |
| Multi-hop Question Answering | 2Wiki | FCR67.6 | 7 | |
| Retrieval | 2Wiki | Recall@587 | 7 | |
| Multi-Hop Question Answering | 2Wiki Platinum (test) | Answer Rate84.8 | 6 | |
| Short-form Question Answering | 2wiki (test) | EM26.5 | 5 | |
| Question Answering | 2Wiki 1,000 samples (test) | F1 Score0.388 | 3 |