| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-Hop Question Answering | Bamboogle | Exact Match56 | 128 | |
| Question Answering | Bamboogle | EM60 | 120 | |
| Multi-hop Question Answering | Bamboogle (test) | EM57.6 | 84 | |
| Multi-hop Question Answering | Bamboogle | Accuracy75.2 | 62 | |
| Multi-Hop Question Answering | Bamboogle | EM78.4 | 51 | |
| Reasoning | Bamboogle | Accuracy73 | 46 | |
| Question Answering | Bamboogle | EM Accuracy (%)48 | 45 | |
| Error Detection | Bamboogle Full | Precision100 | 36 | |
| Error Detection | Bamboogle | F1 Score0.94 | 36 | |
| Multi-Hop Question Answering | Bamboogle (test) | Exact Match (EM)74.2 | 33 | |
| Multi-hop QA | Bamboogle | EM56 | 27 | |
| Multi-Hop QA | Bamboogle | Accuracy (%)74.9 | 25 | |
| Multi-Hop Question Answering | Bamboogle | F161.6 | 25 | |
| Confidence Calibration in Retrieval-Augmented Generation | Bamboogle k=5 OOD (test) | ECE0.065 | 24 | |
| Calibration | Bamboogle | ECE0.113 | 24 | |
| Question Answering | Bamboogle (test) | EM (%)53.6 | 21 | |
| Multi-hop Question Answering | Bamboogle standard (val) | Exact Match (EM)40 | 20 | |
| Multi-Hop Question Answering | Bamboogle (out-of-domain) | Accuracy73.8 | 19 | |
| Knowledge-Intensive Reasoning | Bamboogle | F173.8 | 18 | |
| Multi-Hop Question Answering | Bamboogle | EM42.4 | 18 | |
| Question Answering | Bamboogle | Cover Exact Match62.4 | 18 | |
| Multi-hop Question Answering | Bamboogle (dev test) | F1 Score68.2 | 17 | |
| Multi-Hop Question Answering | Bamboogle | Score50 | 16 | |
| Multi-Hop Question Answering | Bamboogle out-of-domain (val test) | Exact Match (EM)56.4 | 15 | |
| Agentic Search | Bamboogle | String-F1 Score73.1 | 14 |