| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | MuSiQue (test) | F150.9 | 111 | |
| Multi-hop Question Answering | Musique | EM40.5 | 106 | |
| Question Answering | MuSiQue | EM39.6 | 84 | |
| Uncertainty Quantification | Musique 500 randomly sampled queries (test) | AUROC0.8322 | 70 | |
| Question Answering | MuSiQue | F1 Score52.27 | 60 | |
| Question Answering | MuSiQue (test) | F1 Score59.8 | 43 | |
| Multi-hop QA | MuSiQue | EM52.8 | 42 | |
| Multi-hop Reasoning | MuSiQue | EM53 | 41 | |
| Error Detection | MuSiQue (val) | Precision1 | 36 | |
| Error Detection | MuSiQue | F1 Score0.93 | 36 | |
| Question Answering | MuSiQue | Accuracy (ACC)79.9 | 36 | |
| Knowledge-intensive Reasoning | Musique | Accuracy87 | 31 | |
| Multi-hop QA Retrieval | MuSiQue (test) | R@259.6 | 28 | |
| Multi-hop Reasoning | MuSiQue IRCoT 500 samples (test) | ACC25.59 | 27 | |
| Multi-Hop Question Answering | MuSiQue | Exact Match (EM)12.5 | 27 | |
| Multi-hop Question Answering | MuSiQue | Acc43.6 | 26 | |
| Multi-hop Question Answering | MuSiQue | ACCE28.4 | 24 | |
| Question Answering | MuSiQue entity-level knowledge conflict (test) | Mean Rank7.7 | 24 | |
| Multi-hop Question Answering | MuSiQue Full | C Score80.1 | 22 | |
| Multi-hop Question Answering | MusiQue answerable setting | Conciseness38.98 | 21 | |
| RAG Question Answering | Musique | F1 Score20.5 | 20 | |
| Multi-hop Question Answering | MusiQue | EM17.6 | 20 | |
| Question Answering | MuSiQue | LLM Accuracy74.1 | 20 | |
| End-to-end Question Answering | MuSiQue (test val) | EM10.79 | 20 | |
| Knowledge-Intensive Reasoning | MuSiQue | F1 Score34.8 | 18 |