Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination DetectionCoQA
Mean AUROC0.8584
100
Hallucination DetectionCoQA
AUCs77.5
42
Uncertainty estimationCoQA (test)
AUROC77.3
42
Question AnsweringCoQA
CACC76.31
40
Question AnsweringCoQA alpha = 0.25 (test)
Empirical Error Rate (EER)0.2347
40
Question AnsweringCoQA alpha = 0.25 (filtering stage)
EER23.47
40
Language GenerationCoQA
Accuracy65.5
35
Conversational Question AnsweringCOQA zero-shot (test)
Exact Match (EM)70.85
32
Conversational Question AnsweringCoQA
Accuracy75.9
29
Question AnsweringCOQA
Factual Accuracy28.27
21
Selective PredictionCoQA
PRR80.6
20
Hallucination DetectionCoQA
AUPRC89.01
20
Conversational Question AnsweringCoQA official (test)
Overall F188.8
17
Poisoned Sample DetectionCoQA (IID)
Recall100
16
Poisoned sample detectionCoQA (NIID-1)
Recall100
16
Question AnsweringCoQA
PR-AUC60
16
Conversational Question AnsweringCoQA (dev)
Overall F10.849
14
Conversational Question AnsweringCOQA
AIBC86.5
12
Noisy-RAG Question AnsweringCoQA
Exact Match (EM)92.4
11
Conversational Question AnsweringCoQA
F1 Score62.65
10
Answer span extractionCoQA (val)
EM63.65
9
Question GenerationCoQA (val)
Distinct-168.35
9
Answer-unaware Conversational Question GenerationCoQA (dev)
Distinct-184.09
9
Conversational Question AnsweringCoQA
EM60.3
8
Question AnsweringCoQA zero-shot (test)
F1 Score73
6
Showing 25 of 40 rows