Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HotpotQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-hop Question AnsweringHotpotQA
F1 Score77.4
294
Multi-hop Question AnsweringHotpotQA (test)
F180.79
255
Hallucination DetectionHotpotQA
AUROC0.928
163
Question AnsweringHotpotQA
F184.98
128
Multi-Hop Question AnsweringHotpotQA
Exact Match (EM)50.2
117
Question AnsweringHotpotQA
EM77.2
109
RAG Performance PredictionHotpotQA
QE50.78
80
Multi-hop Question AnsweringHotpotQA
F174.9
79
Multi-Hop QAHotPotQA
Exact Match65.6
76
Long-context Question AnsweringHotpotQA In-Distribution
Accuracy85.2
72
Uncertainty QuantificationHotpotQA 500 randomly sampled queries (test)
AUROC83.25
70
Multi-hop Question AnsweringHotpotQA fullwiki setting (test)
Answer F175.9
64
Multi-hop Question AnsweringHotPotQA
CoT Match Rate100
54
Retrieval-Augmented GenerationHotpotQA
Reliability Score (RS)51.8
52
Multi-hop Question AnsweringHotpotQA
F175.97
48
Answer extraction and supporting sentence predictionHotpotQA fullwiki (test)
Answer EM67.5
48
Question AnsweringHotpotQA distractor (dev)
Answer F184.2
45
Question AnsweringHotpotQA (dev)
Answer F181
43
Multi-hop Question AnsweringHotpotQA (dev)
Answer F181.62
43
Indirect Prompt InjectionHotpotQA
ASR100
42
Question AnsweringHotpotQA
Recall89.5
42
RAG AttackHotpotQA
Attack Success Rate (ASR)96.1
41
Multi-Hop Question AnsweringHotpotQA
SubEM39.12
40
Question AnsweringHotpotQA (test)
EM57.2
39
Multi-hop Question AnsweringHotpotQA fullwiki setting (dev)
Answer F181.5
38
Showing 25 of 391 rows
...