| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | Bamboogle | EM60 | 227 | |
| Multi-Hop Question Answering | Bamboogle | Exact Match56 | 128 | |
| Multi-hop Question Answering | Bamboogle (test) | EM57.6 | 98 | |
| Question Answering | Bamboogle | EM Accuracy (%)48 | 68 | |
| Multi-hop Question Answering | Bamboogle | Accuracy75.2 | 62 | |
| Question Answering | Bamboogle | EM64.1 | 61 | |
| Multi-Hop Question Answering | Bamboogle | Exact Match (EM)54.4 | 55 | |
| Multi-Hop Question Answering | Bamboogle | EM78.4 | 51 | |
| Question Answering | Bamboogle (test) | EM (%)53.6 | 47 | |
| Multi-Hop QA | Bamboogle | Exact Match (EM)57.8 | 46 | |
| Reasoning | Bamboogle | Accuracy73 | 46 | |
| Multi-Hop Question Answering | Bamboogle | Accuracy47.6 | 44 | |
| Error Detection | Bamboogle Full | Precision100 | 36 | |
| Error Detection | Bamboogle | F1 Score0.94 | 36 | |
| Multi-Hop Question Answering | Bamboogle (test) | Exact Match (EM)74.2 | 33 | |
| Multi-hop QA | Bamboogle | EM56 | 27 | |
| Multi-Hop QA | Bamboogle | Accuracy (%)74.9 | 25 | |
| Multi-Hop Question Answering | Bamboogle | F161.6 | 25 | |
| Open-domain Question Answering | Bamboogle out-of-domain | F171.7 | 24 | |
| Multi-hop Question Answering | Bamboogle online Google Search API (test val) | Exact Match68.7 | 24 | |
| Multi-hop Question Answering | Bamboogle offline Wiki-18 (test val) | Exact Match (EM)53.4 | 24 | |
| Confidence Calibration in Retrieval-Augmented Generation | Bamboogle k=5 OOD (test) | ECE0.065 | 24 | |
| Calibration | Bamboogle | ECE0.113 | 24 | |
| Knowledge-Intensive Reasoning | Bamboogle | F173.8 | 23 | |
| Multi-hop Question Answering | Bamboogle | EM57.6 | 21 |