| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | F1 Score77.4 | 294 | |
| Multi-hop Question Answering | HotpotQA (test) | F180.79 | 255 | |
| Hallucination Detection | HotpotQA | AUROC0.928 | 163 | |
| Question Answering | HotpotQA | F184.98 | 128 | |
| Multi-Hop Question Answering | HotpotQA | Exact Match (EM)50.2 | 117 | |
| Question Answering | HotpotQA | EM77.2 | 109 | |
| RAG Performance Prediction | HotpotQA | QE50.78 | 80 | |
| Multi-hop Question Answering | HotpotQA | F174.9 | 79 | |
| Multi-Hop QA | HotPotQA | Exact Match65.6 | 76 | |
| Long-context Question Answering | HotpotQA In-Distribution | Accuracy85.2 | 72 | |
| Uncertainty Quantification | HotpotQA 500 randomly sampled queries (test) | AUROC83.25 | 70 | |
| Multi-hop Question Answering | HotpotQA fullwiki setting (test) | Answer F175.9 | 64 | |
| Multi-hop Question Answering | HotPotQA | CoT Match Rate100 | 54 | |
| Retrieval-Augmented Generation | HotpotQA | Reliability Score (RS)51.8 | 52 | |
| Multi-hop Question Answering | HotpotQA | F175.97 | 48 | |
| Answer extraction and supporting sentence prediction | HotpotQA fullwiki (test) | Answer EM67.5 | 48 | |
| Question Answering | HotpotQA distractor (dev) | Answer F184.2 | 45 | |
| Question Answering | HotpotQA (dev) | Answer F181 | 43 | |
| Multi-hop Question Answering | HotpotQA (dev) | Answer F181.62 | 43 | |
| Indirect Prompt Injection | HotpotQA | ASR100 | 42 | |
| Question Answering | HotpotQA | Recall89.5 | 42 | |
| RAG Attack | HotpotQA | Attack Success Rate (ASR)96.1 | 41 | |
| Multi-Hop Question Answering | HotpotQA | SubEM39.12 | 40 | |
| Question Answering | HotpotQA (test) | EM57.2 | 39 | |
| Multi-hop Question Answering | HotpotQA fullwiki setting (dev) | Answer F181.5 | 38 |