Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PopQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Single-Hop Question AnsweringPopQA
EM73.6
186
Question AnsweringPopQA
Accuracy68.4
186
Question AnsweringPopQA
Exact Match64.2
133
Question AnsweringPopQA (test)
Accuracy77.2
111
Question AnsweringPopQA
Accuracy87.12
103
Question AnsweringPopQA
EM51.6
98
Hallucination DetectionPopQA
AUC96.18
97
Uncertainty QuantificationPopQA 500 randomly sampled queries (test)
AUROC0.8709
70
General QAPopQA
Exact Match (EM)52
58
Factual Knowledge EvaluationPopQA
Accuracy35.3
56
General Question AnsweringPopQA
EM45.2
51
Question AnsweringPopQA
Score43.93
50
Knowledge RetrievalPopQA
F1 Score67.85
45
Simple Question AnsweringPopQA (test)
RAG F165.24
36
Single-hop Question AnsweringPopQA (test)
Accuracy51.5
33
Question AnsweringPopQA
F1 Score59.9
30
Question AnsweringPopQA
EM (%)47.82
27
Question AnsweringPopQA
EM46.1
27
Question AnsweringPopQA
Accuracy (Acc)70.7
26
AbstentionPopQA (test)
AUARC66.06
25
AbstentionPopQA
Abstain Accuracy81.6
25
Hallucination DetectionPopQA n=1000 (test)
AUROC0.895
24
Question AnsweringPopQA longtail
EM45.96
23
Hallucination DetectionPopQA
AUPRC67.08
20
Question AnsweringPopQA
FAR (Overall)52.3
19
Showing 25 of 105 rows