| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | PopQA | Accuracy68.4 | 186 | |
| Hallucination Detection | PopQA | AUC96.18 | 88 | |
| Question Answering | PopQA | EM51.6 | 80 | |
| Uncertainty Quantification | PopQA 500 randomly sampled queries (test) | AUROC0.8709 | 70 | |
| Single-Hop Question Answering | PopQA | EM61.6 | 55 | |
| Question Answering | PopQA | Score43.93 | 50 | |
| Question Answering | PopQA (test) | Accuracy65.4 | 39 | |
| General Question Answering | PopQA | EM44.8 | 36 | |
| Factual Knowledge Evaluation | PopQA | Accuracy18 | 32 | |
| Abstention | PopQA (test) | AUARC66.06 | 25 | |
| Abstention | PopQA | Abstain Accuracy81.6 | 25 | |
| Question Answering | PopQA longtail | EM45.96 | 23 | |
| Single-hop Question Answering | PopQA (test) | Accuracy44.2 | 21 | |
| General Question Answering | PopQA | Accuracy48.8 | 18 | |
| Question Answering | PopQA | EM34.2 | 17 | |
| Question Answering | PopQA (Frequent) | Exact Match (EM)52.7 | 16 | |
| Question Answering | PopQA Infrequent | Exact Match Accuracy42.9 | 16 | |
| Question Answering | PopQA | Accuracy41.3 | 16 | |
| General QA Verification | PopQA | P@190.14 | 16 | |
| Uncertainty Estimation | PopQA | A183.09 | 16 | |
| Question Answering | PopQA v1.0 (test) | A183.07 | 16 | |
| General Question Answering | PopQA out-of-domain (val test) | Exact Match (EM)50.1 | 15 | |
| Retrieval | POPQA Long-tail | Recall@1074.5 | 14 | |
| Question Answering | PopQA | F1 Score59.9 | 14 | |
| Open-domain Question Answering | PopQA | Accuracy65.7 | 11 |