| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | F1 Score75.39 | 221 | |
| Multi-hop Question Answering | HotpotQA (test) | F180.79 | 198 | |
| Hallucination Detection | HotpotQA | AUROC0.928 | 118 | |
| Question Answering | HotpotQA | F184.98 | 114 | |
| Multi-hop Question Answering | HotpotQA | F174.9 | 79 | |
| Question Answering | HotpotQA | EM77.2 | 79 | |
| Long-context Question Answering | HotpotQA In-Distribution | Accuracy85.2 | 72 | |
| Uncertainty Quantification | HotpotQA 500 randomly sampled queries (test) | AUROC83.25 | 70 | |
| Multi-hop Question Answering | HotpotQA fullwiki setting (test) | Answer F175.9 | 64 | |
| Multi-Hop Question Answering | HotpotQA | Exact Match (EM)47.68 | 56 | |
| Retrieval-Augmented Generation | HotpotQA | Reliability Score (RS)51.8 | 52 | |
| Multi-hop Question Answering | HotpotQA | F175.97 | 48 | |
| Answer extraction and supporting sentence prediction | HotpotQA fullwiki (test) | Answer EM67.5 | 48 | |
| Question Answering | HotpotQA distractor (dev) | Answer F184.2 | 45 | |
| Question Answering | HotpotQA (dev) | Answer F181 | 43 | |
| Multi-hop Question Answering | HotpotQA (dev) | Answer F181.62 | 43 | |
| Indirect Prompt Injection | HotpotQA | ASR100 | 42 | |
| Multi-Hop Question Answering | HotpotQA | SubEM39.12 | 40 | |
| Question Answering | HotpotQA (test) | EM57.2 | 39 | |
| Multi-hop Question Answering | HotpotQA fullwiki setting (dev) | Answer F181.5 | 38 | |
| Question Answering | HotpotQA (test) | Ans F182.2 | 37 | |
| Question Answering | HotpotQA | F1 Score69.5 | 36 | |
| Error Detection | HotpotQA (val) | Precision100 | 36 | |
| Error Detection | HotpotQA | F1 Score91 | 36 | |
| Question Answering | HotpotQA distractor setting (test) | Answer F182.2 | 34 |