| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | PopQA | Accuracy68.4 | 186 | |
| Single-Hop Question Answering | PopQA | EM61.6 | 104 | |
| Hallucination Detection | PopQA | AUC96.18 | 88 | |
| Question Answering | PopQA | EM51.6 | 88 | |
| Question Answering | PopQA (test) | Accuracy68.3 | 72 | |
| Uncertainty Quantification | PopQA 500 randomly sampled queries (test) | AUROC0.8709 | 70 | |
| Question Answering | PopQA | Accuracy74.29 | 52 | |
| General Question Answering | PopQA | EM45.2 | 51 | |
| Question Answering | PopQA | Score43.93 | 50 | |
| Factual Knowledge Evaluation | PopQA | Accuracy18 | 32 | |
| Question Answering | PopQA | F1 Score59.9 | 30 | |
| General QA | PopQA | Exact Match (EM)52 | 28 | |
| Question Answering | PopQA | Accuracy (Acc)70.7 | 26 | |
| Question Answering | PopQA | Exact Match47 | 25 | |
| Abstention | PopQA (test) | AUARC66.06 | 25 | |
| Abstention | PopQA | Abstain Accuracy81.6 | 25 | |
| Question Answering | PopQA longtail | EM45.96 | 23 | |
| Single-hop Question Answering | PopQA (test) | Accuracy44.2 | 21 | |
| Hallucination Detection | PopQA | AUPRC67.08 | 20 | |
| Question Answering | PopQA | FAR (Overall)52.3 | 19 | |
| Retrieval | PopQA | R@565.5 | 19 | |
| General Question Answering | PopQA | Accuracy48.8 | 18 | |
| Question Answering | PopQA | EM41.6 | 17 | |
| Question Answering | PopQA | EM34.2 | 17 | |
| Calibration | PopQA | ECE0.018 | 16 |