| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-Hop Question Answering | Bamboogle | Exact Match48 | 97 | |
| Question Answering | Bamboogle | EM60 | 62 | |
| Multi-hop Question Answering | Bamboogle | Accuracy75.2 | 52 | |
| Multi-hop Question Answering | Bamboogle (test) | EM57.6 | 46 | |
| Multi-Hop Question Answering | Bamboogle | EM32.23 | 37 | |
| Error Detection | Bamboogle Full | Precision100 | 36 | |
| Error Detection | Bamboogle | F1 Score0.94 | 36 | |
| Multi-Hop Question Answering | Bamboogle | F161.6 | 25 | |
| Confidence Calibration in Retrieval-Augmented Generation | Bamboogle k=5 OOD (test) | ECE0.065 | 24 | |
| Calibration | Bamboogle | ECE0.113 | 24 | |
| Question Answering | Bamboogle (test) | EM (%)35.5 | 18 | |
| Knowledge-Intensive Reasoning | Bamboogle | F173.8 | 18 | |
| Multi-Hop Question Answering | Bamboogle | EM42.4 | 18 | |
| Question Answering | Bamboogle | Cover Exact Match62.4 | 18 | |
| Multi-Hop Question Answering | Bamboogle out-of-domain (val test) | Exact Match (EM)56.4 | 15 | |
| Multi-hop Question Answering | Bamboogle out-of-domain (test) | Accuracy (R)68.8 | 14 | |
| Question Answering | Bamboogle 500 samples (val) | EM34.6 | 14 | |
| Agentic Search | Bamboogle | LJFT Score64.8 | 12 | |
| Compositional multi-hop QA | Bamboogle | Success Rate77.6 | 12 | |
| Multi-Hop Question Answering | Bamboogle (out-of-domain) | Accuracy73.8 | 10 | |
| Question Answering | Bamboogle multi-hop (test) | Avg@1640.1 | 10 | |
| Multi-step Reasoning | Bamboogle auto-eval (test) | Mean Accuracy76.1 | 10 | |
| Multi-hop QA | Bamboogle | EM56 | 9 | |
| Question Answering | Bamboogle | ECE0.521 | 8 | |
| Multi-hop Open-domain Question Answering | Bamboogle | Accuracy77.3 | 6 |