Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HotpotQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-hop Question AnsweringHotpotQA
F1 Score75.39
221
Multi-hop Question AnsweringHotpotQA (test)
F180.79
198
Hallucination DetectionHotpotQA
AUROC0.928
118
Question AnsweringHotpotQA
F184.98
114
Multi-hop Question AnsweringHotpotQA
F174.9
79
Question AnsweringHotpotQA
EM77.2
79
Long-context Question AnsweringHotpotQA In-Distribution
Accuracy85.2
72
Uncertainty QuantificationHotpotQA 500 randomly sampled queries (test)
AUROC83.25
70
Multi-hop Question AnsweringHotpotQA fullwiki setting (test)
Answer F175.9
64
Multi-Hop Question AnsweringHotpotQA
Exact Match (EM)47.68
56
Retrieval-Augmented GenerationHotpotQA
Reliability Score (RS)51.8
52
Multi-hop Question AnsweringHotpotQA
F175.97
48
Answer extraction and supporting sentence predictionHotpotQA fullwiki (test)
Answer EM67.5
48
Question AnsweringHotpotQA distractor (dev)
Answer F184.2
45
Question AnsweringHotpotQA (dev)
Answer F181
43
Multi-hop Question AnsweringHotpotQA (dev)
Answer F181.62
43
Indirect Prompt InjectionHotpotQA
ASR100
42
Multi-Hop Question AnsweringHotpotQA
SubEM39.12
40
Question AnsweringHotpotQA (test)
EM57.2
39
Multi-hop Question AnsweringHotpotQA fullwiki setting (dev)
Answer F181.5
38
Question AnsweringHotpotQA (test)
Ans F182.2
37
Question AnsweringHotpotQA
F1 Score69.5
36
Error DetectionHotpotQA (val)
Precision100
36
Error DetectionHotpotQA
F1 Score91
36
Question AnsweringHotpotQA distractor setting (test)
Answer F182.2
34
Showing 25 of 266 rows
...