| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA (test) | F180.79 | 311 | |
| Multi-hop Question Answering | HotpotQA | F1 Score77.4 | 294 | |
| Hallucination Detection | HotpotQA | AUROC0.928 | 249 | |
| Question Answering | HotpotQA | EM77.2 | 173 | |
| Multi-Hop Question Answering | HotpotQA | Exact Match (EM)50.2 | 150 | |
| Multi-Hop QA | HotPotQA | Exact Match65.6 | 143 | |
| Question Answering | HotpotQA | F184.98 | 132 | |
| RAG Performance Prediction | HotpotQA | QE50.78 | 80 | |
| Multi-hop Question Answering | HotpotQA | F174.9 | 79 | |
| Open-domain Question Answering | HotpotQA | Accuracy83.8 | 73 | |
| Multi-hop Question Answering | HotpotQA | LLM Judge Score80 | 72 | |
| Long-context Question Answering | HotpotQA In-Distribution | Accuracy85.2 | 72 | |
| Uncertainty Quantification | HotpotQA 500 randomly sampled queries (test) | AUROC83.25 | 70 | |
| End-to-End Defense in RAG | HotpotQA | Attack Success Rate (ASR)0 | 69 | |
| Retrieval | HotpotQA | R@596.9 | 68 | |
| Multi-Hop Question Answering | HotpotQA | Exact Match (EM)47.1 | 66 | |
| Multi-hop Question Answering | HotpotQA fullwiki setting (test) | Answer F175.9 | 64 | |
| Question Answering | HotpotQA PIA (test) | ASR90.2 | 62 | |
| Open-domain Question Answering | HotpotQA in-domain | F1 Score72.4 | 57 | |
| Error Detection | HotpotQA | AUROC81 | 57 | |
| Multi-hop Question Answering | HotPotQA | CoT Match Rate100 | 54 | |
| Multi-Hop Question Answering | HotpotQA | F158.9 | 54 | |
| Retrieval-Augmented Generation | HotpotQA | Reliability Score (RS)51.8 | 52 | |
| General Text Question Answering | HotpotQA | Accuracy86.7 | 51 | |
| Multi-hop Question Answering | HotpotQA | Exact Match (EM)55 | 50 |