| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | Musique | EM46 | 185 | |
| Multi-hop Question Answering | MuSiQue (test) | F150.9 | 111 | |
| Question Answering | MuSiQue | EM39.6 | 84 | |
| Uncertainty Quantification | Musique 500 randomly sampled queries (test) | AUROC0.8322 | 70 | |
| Question Answering | MuSiQue | F1 Score52.27 | 70 | |
| Multi-hop QA | MuSiQue | EM77.2 | 65 | |
| Multi-Hop Question Answering | MuSiQue | Exact Match (EM)25.3 | 58 | |
| Question Answering | Musique | EM26 | 50 | |
| Question Answering | MuSiQue (test) | F1 Score59.8 | 43 | |
| Multi-hop Reasoning | MuSiQue | EM53 | 41 | |
| Multi-hop Question Answering | MuSiQue | F146.1 | 38 | |
| Multi-hop Question Answering | MuSiQue (test) | Token Cost4,987 | 36 | |
| Multi-hop QA Retrieval | MuSiQue | R@254.8 | 36 | |
| Error Detection | MuSiQue (val) | Precision1 | 36 | |
| Error Detection | MuSiQue | F1 Score0.93 | 36 | |
| Question Answering | MuSiQue | Accuracy (ACC)79.9 | 36 | |
| Question Answering | MuSiQue | LLM Accuracy74.1 | 34 | |
| Multi-hop QA Retrieval | MuSiQue (test) | R@581.46 | 33 | |
| Knowledge-intensive Reasoning | Musique | Accuracy87 | 31 | |
| Poisoning Attack | MuSiQue | Attack Success Rate (ASR)87.9 | 30 | |
| Long-context understanding | MuSiQue | SubEM51 | 27 | |
| Multi-hop Reasoning | MuSiQue | Accuracy39.6 | 27 | |
| Multi-hop Reasoning | MuSiQue IRCoT 500 samples (test) | ACC25.59 | 27 | |
| Question Answering | MuSiQue | F1 Score81.3 | 25 | |
| Multi-hop Question Answering | MuSiQue | Accuracy47.55 | 24 |