| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Single-Hop Question Answering | PopQA | EM73.6 | 186 | |
| Question Answering | PopQA | Accuracy68.4 | 186 | |
| Question Answering | PopQA | Exact Match64.2 | 133 | |
| Question Answering | PopQA (test) | Accuracy77.2 | 111 | |
| Question Answering | PopQA | Accuracy87.12 | 103 | |
| Question Answering | PopQA | EM51.6 | 98 | |
| Hallucination Detection | PopQA | AUC96.18 | 97 | |
| Uncertainty Quantification | PopQA 500 randomly sampled queries (test) | AUROC0.8709 | 70 | |
| General QA | PopQA | Exact Match (EM)52 | 58 | |
| Factual Knowledge Evaluation | PopQA | Accuracy35.3 | 56 | |
| General Question Answering | PopQA | EM45.2 | 51 | |
| Question Answering | PopQA | Score43.93 | 50 | |
| Knowledge Retrieval | PopQA | F1 Score67.85 | 45 | |
| Simple Question Answering | PopQA (test) | RAG F165.24 | 36 | |
| Single-hop Question Answering | PopQA (test) | Accuracy51.5 | 33 | |
| Question Answering | PopQA | F1 Score59.9 | 30 | |
| Question Answering | PopQA | EM (%)47.82 | 27 | |
| Question Answering | PopQA | EM46.1 | 27 | |
| Question Answering | PopQA | Accuracy (Acc)70.7 | 26 | |
| Abstention | PopQA (test) | AUARC66.06 | 25 | |
| Abstention | PopQA | Abstain Accuracy81.6 | 25 | |
| Hallucination Detection | PopQA n=1000 (test) | AUROC0.895 | 24 | |
| Question Answering | PopQA longtail | EM45.96 | 23 | |
| Hallucination Detection | PopQA | AUPRC67.08 | 20 | |
| Question Answering | PopQA | FAR (Overall)52.3 | 19 |