| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | TriviaQA | AUROC0.95 | 265 | |
| Question Answering | TriviaQA | Accuracy86.68 | 210 | |
| Hallucination Detection | TriviaQA (test) | AUC-ROC92.23 | 169 | |
| Question Answering | TriviaQA (test) | Accuracy85.18 | 121 | |
| Question Answering | TriviaQA | EM86.1 | 116 | |
| Question Answering | TriviaQA | Accuracy94.5 | 85 | |
| Open-Domain Question Answering | TriviaQA (test) | Exact Match72.6 | 80 | |
| Uncertainty Estimation | TriviaQA (test) | AUROC82.12 | 78 | |
| Passage retrieval | TriviaQA (test) | Top-100 Acc90.1 | 67 | |
| Question Answering | TriviaQA | ACC75 | 62 | |
| Open-domain Question Answering | TriviaQA | EM76.1 | 62 | |
| Single-hop Question Answering | TriviaQA | EM72 | 62 | |
| Open-domain Question Answering | TriviaQA open (test) | EM73.3 | 59 | |
| Question Answering | TriviaQA (TQA) | EM71.1 | 56 | |
| Retrieval-Augmented Generation (RAG) | TriviaQA | Reliability Score80.67 | 52 | |
| Question Answering | TriviaQA | C79.9 | 48 | |
| Question Answering | TriviaQA Wiki (val) | Exact Match (EM)87.6 | 48 | |
| Question Answering | TriviaQA | F189.02 | 46 | |
| Question Answering | TriviaQA (TQA) (test) | Robust Accuracy75.4 | 45 | |
| Open-Domain Question Answering | TriviaQA | SubEM74.01 | 40 | |
| Question Answering | TriviaQA | C79.9 | 40 | |
| End-to-end Open-Domain Question Answering | TriviaQA (test) | Exact Match (EM)71.5 | 40 | |
| Calibration | TriviaQA | Brier Score0.0845 | 39 | |
| General Question Answering | TriviaQA | Exact Match69.02 | 39 | |
| Uncertainty Estimation | TriviaQA | AUROC83.63 | 37 |