Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HotpotQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-hop Question AnsweringHotpotQA (test)
F180.79
311
Multi-hop Question AnsweringHotpotQA
F1 Score77.4
294
Hallucination DetectionHotpotQA
AUROC0.928
249
Question AnsweringHotpotQA
EM77.2
173
Multi-Hop Question AnsweringHotpotQA
Exact Match (EM)50.2
150
Multi-Hop QAHotPotQA
Exact Match65.6
143
Question AnsweringHotpotQA
F184.98
132
RAG Performance PredictionHotpotQA
QE50.78
80
Multi-hop Question AnsweringHotpotQA
F174.9
79
Open-domain Question AnsweringHotpotQA
Accuracy83.8
73
Multi-hop Question AnsweringHotpotQA
LLM Judge Score80
72
Long-context Question AnsweringHotpotQA In-Distribution
Accuracy85.2
72
Uncertainty QuantificationHotpotQA 500 randomly sampled queries (test)
AUROC83.25
70
End-to-End Defense in RAGHotpotQA
Attack Success Rate (ASR)0
69
RetrievalHotpotQA
R@596.9
68
Multi-Hop Question AnsweringHotpotQA
Exact Match (EM)47.1
66
Multi-hop Question AnsweringHotpotQA fullwiki setting (test)
Answer F175.9
64
Question AnsweringHotpotQA PIA (test)
ASR90.2
62
Open-domain Question AnsweringHotpotQA in-domain
F1 Score72.4
57
Error DetectionHotpotQA
AUROC81
57
Multi-hop Question AnsweringHotPotQA
CoT Match Rate100
54
Multi-Hop Question AnsweringHotpotQA
F158.9
54
Retrieval-Augmented GenerationHotpotQA
Reliability Score (RS)51.8
52
General Text Question AnsweringHotpotQA
Accuracy86.7
51
Multi-hop Question AnsweringHotpotQA
Exact Match (EM)55
50
Showing 25 of 563 rows
...