| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | CoQA | Mean AUROC0.8584 | 48 | |
| Hallucination Detection | CoQA | AUCs77.5 | 42 | |
| Uncertainty estimation | CoQA (test) | AUROC77.3 | 42 | |
| Question Answering | CoQA alpha = 0.25 (test) | Empirical Error Rate (EER)0.2347 | 40 | |
| Question Answering | CoQA alpha = 0.25 (filtering stage) | EER23.47 | 40 | |
| Language Generation | CoQA | Accuracy65.5 | 35 | |
| Conversational Question Answering | COQA zero-shot (test) | Exact Match (EM)70.85 | 32 | |
| Conversational Question Answering | CoQA | Accuracy75.9 | 29 | |
| Question Answering | COQA | Factual Accuracy28.27 | 21 | |
| Conversational Question Answering | CoQA official (test) | Overall F188.8 | 17 | |
| Question Answering | CoQA | PR-AUC60 | 16 | |
| Conversational Question Answering | CoQA (dev) | Overall F10.849 | 14 | |
| Conversational Question Answering | COQA | AIBC86.5 | 12 | |
| Noisy-RAG Question Answering | CoQA | Exact Match (EM)92.4 | 11 | |
| Conversational Question Answering | CoQA | F1 Score62.65 | 10 | |
| Answer span extraction | CoQA (val) | EM63.65 | 9 | |
| Question Generation | CoQA (val) | Distinct-168.35 | 9 | |
| Answer-unaware Conversational Question Generation | CoQA (dev) | Distinct-184.09 | 9 | |
| Conversational Question Answering | CoQA | EM60.3 | 8 | |
| Question Answering | CoQA zero-shot (test) | F1 Score73 | 6 | |
| Question Answering | CoQA (val test) | F173 | 6 | |
| Reading Comprehension | CoQA (dev) | F1 Score85 | 6 | |
| Conversational Question Answering | CoQA without human rewrites v1.0 (test) | Overall F183.4 | 6 | |
| Dialogue Generation | CoQA CNN | BLEU15.11 | 5 | |
| Dialogue Generation | CoQA (MCTest) | BLEU26.3 | 5 |