Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PopQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringPopQA
Accuracy68.4
186
Single-Hop Question AnsweringPopQA
EM61.6
104
Hallucination DetectionPopQA
AUC96.18
88
Question AnsweringPopQA
EM51.6
88
Question AnsweringPopQA (test)
Accuracy68.3
72
Uncertainty QuantificationPopQA 500 randomly sampled queries (test)
AUROC0.8709
70
Question AnsweringPopQA
Accuracy74.29
52
General Question AnsweringPopQA
EM45.2
51
Question AnsweringPopQA
Score43.93
50
Factual Knowledge EvaluationPopQA
Accuracy18
32
Question AnsweringPopQA
F1 Score59.9
30
General QAPopQA
Exact Match (EM)52
28
Question AnsweringPopQA
Accuracy (Acc)70.7
26
Question AnsweringPopQA
Exact Match47
25
AbstentionPopQA (test)
AUARC66.06
25
AbstentionPopQA
Abstain Accuracy81.6
25
Question AnsweringPopQA longtail
EM45.96
23
Single-hop Question AnsweringPopQA (test)
Accuracy44.2
21
Hallucination DetectionPopQA
AUPRC67.08
20
Question AnsweringPopQA
FAR (Overall)52.3
19
RetrievalPopQA
R@565.5
19
General Question AnsweringPopQA
Accuracy48.8
18
Question AnsweringPopQA
EM41.6
17
Question AnsweringPopQA
EM34.2
17
CalibrationPopQA
ECE0.018
16
Showing 25 of 75 rows