| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | TriviaQA | AUROC0.95 | 438 | |
| Question Answering | TriviaQA | Accuracy86.68 | 238 | |
| Hallucination Detection | TriviaQA (test) | AUC-ROC92.9 | 183 | |
| Question Answering | TriviaQA | EM86.1 | 182 | |
| Question Answering | TriviaQA (test) | Accuracy85.18 | 121 | |
| Question Answering | TriviaQA | Accuracy94.5 | 112 | |
| Uncertainty Estimation | TriviaQA (test) | AUROC87.91 | 104 | |
| Single-hop Question Answering | TriviaQA | EM72 | 81 | |
| RAG Performance Prediction | TriviaQA | QE5 Score0.889 | 80 | |
| Open-Domain Question Answering | TriviaQA (test) | Exact Match72.6 | 80 | |
| Uncertainty Estimation | TriviaQA | AUROC85.56 | 77 | |
| Passage retrieval | TriviaQA (test) | Top-100 Acc90.1 | 67 | |
| Question Answering | TriviaQA | ACC75 | 62 | |
| Open-domain Question Answering | TriviaQA | EM76.1 | 62 | |
| Open-domain Question Answering | TriviaQA open (test) | EM73.3 | 59 | |
| Question Answering | TriviaQA (test) | EM92.1 | 58 | |
| Question Answering | TriviaQA (TQA) | EM71.1 | 56 | |
| Question Answering | TriviaQA | BLEU36.84 | 54 | |
| General Question Answering | TriviaQA | Exact Match69.02 | 54 | |
| Retrieval-Augmented Generation (RAG) | TriviaQA | Reliability Score80.67 | 52 | |
| Question Answering | TriviaQA Wiki (val) | Exact Match (EM)87.6 | 52 | |
| Question Answering | TriviaQA | C79.9 | 48 | |
| Question Answering | TriviaQA | F189.02 | 46 | |
| Correctness Prediction | TriviaQA | AUROC0.852 | 45 | |
| Question Answering | TriviaQA (TQA) (test) | Robust Accuracy75.4 | 45 |