| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | Musique | EM46 | 209 | |
| Multi-hop Question Answering | MuSiQue (test) | F155.68 | 128 | |
| Multi-hop QA | MuSiQue | EM77.2 | 95 | |
| Question Answering | MuSiQue | EM39.6 | 84 | |
| Question Answering | MuSiQue | F1 Score50 | 80 | |
| Question Answering | MuSiQue | F1 Score52.27 | 79 | |
| Question Answering | MuSiQue (test) | EM48 | 76 | |
| Question Answering | Musique | EM22.92 | 71 | |
| Uncertainty Quantification | Musique 500 randomly sampled queries (test) | AUROC0.8322 | 70 | |
| Question Answering | Musique | EM26 | 62 | |
| Multi-Hop Question Answering | MuSiQue | Exact Match (EM)25.3 | 58 | |
| Open-domain Question Answering | MusiQue out-of-domain | F135.8 | 57 | |
| Question Answering | MuSiQue | F1 Score81.3 | 54 | |
| Multi-Hop Question Answering | MuSiQue | Exact Match (EM)22.6 | 51 | |
| Multi-hop Question Answering | MuSiQue | EM40 | 50 | |
| Multi-hop Question Answering | MusiQue | EM36.8 | 50 | |
| Multi-hop Reasoning | MuSiQue | Accuracy51 | 48 | |
| Retrieval | Musique | F1 Score28.91 | 45 | |
| Multi-hop Question Answering | MuSiQue | String Accuracy48.4 | 44 | |
| Knowledge-Intensive Reasoning | MuSiQue | F1 Score34.8 | 43 | |
| Question Answering | MuSiQue (held-out) | F1 Score57.7 | 42 | |
| Multi-hop Reasoning | MuSiQue | EM53 | 41 | |
| Question Answering | MuSiQue | EM38.7 | 38 | |
| Multi-hop Question Answering | MuSiQue | F146.1 | 38 | |
| Text Question Answering | MuSiQue | Accuracy69.6 | 37 |