| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | CoQA | AUROC84.92 | 108 | |
| Hallucination Detection | CoQA | Mean AUROC0.8584 | 107 | |
| Selective Generation | CoQA | ROC-AUC74.7 | 66 | |
| Question Answering | CoQA | CACC76.31 | 64 | |
| Uncertainty Estimation | CoQA | AUROC0.857 | 58 | |
| Question Answering | CoQA | PRR0.423 | 44 | |
| Hallucination Detection | CoQA | AUCs77.5 | 42 | |
| Uncertainty estimation | CoQA (test) | AUROC77.3 | 42 | |
| Question Answering | CoQA alpha = 0.25 (test) | Empirical Error Rate (EER)0.2347 | 40 | |
| Question Answering | CoQA alpha = 0.25 (filtering stage) | EER23.47 | 40 | |
| Hallucination Detection | CoQA | AUROC91.74 | 39 | |
| Language Generation | CoQA | Accuracy65.5 | 35 | |
| Conversational Question Answering | COQA zero-shot (test) | Exact Match (EM)70.85 | 32 | |
| Conversational Question Answering | CoQA | Accuracy75.9 | 29 | |
| Question Answering | CoQA | F1 Score76 | 28 | |
| Conversational Question Answering | CoQA | PRR40.7 | 22 | |
| Free-form text generation | CoQA | Accuracy94.61 | 22 | |
| Question Answering | COQA | Factual Accuracy28.27 | 21 | |
| Hallucination detection | CoQA | AUROC0.98 | 20 | |
| Selective Prediction | CoQA | PRR80.6 | 20 | |
| Hallucination Detection | CoQA | AUPRC89.01 | 20 | |
| Conversational Question Answering | CoQA official (test) | Overall F188.8 | 17 | |
| Poisoned Sample Detection | CoQA (IID) | Recall100 | 16 | |
| Poisoned sample detection | CoQA (NIID-1) | Recall100 | 16 | |
| Question Answering | CoQA | PR-AUC60 | 16 |