Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

2WikiMultiHopQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-hop Question Answering2WikiMultiHopQA
EM82.1
387
Multi-hop Question Answering2WikiMultiHopQA (test)
EM73.9
195
Question Answering2WikiMultihopQA
EM47.7
107
Question Answering2WikiMultihopQA (test)
F178.9
81
Multi-hop Question Answering2WikiMultiHopQA Out-Of-Distribution (OOD)
Accuracy74.2
72
Long-context Question Answering2WikiMultiHopQA (Out-Of-Distribution)
Accuracy63.9
54
Multi-hop QA Retrieval2WikiMultihopQA (test)
R@597.2
33
Question Answering2WikiMultihopQA LongBench
F1 Score53.61
28
Question Answering2WikiMultihopQA
Accuracy62.5
25
Multi-hop Question Answering2WikiMultiHopQA (val)
ASR95.4
24
Multi-hop Question Answering2WikiMultiHopQA N=200
Judge EM77
24
Knowledge composition selection2WikiMultihopQA
Precision @ K=2100
23
Latent multi-hop reasoning2WikiMultiHopQA
Precision96.86
22
Multi-hop Question Answering2WikiMultiHopQA Full
Accuracy (C)87.5
22
Retrieval2WikiMultiHopQA v1 (test)
R@2E85
21
Question Answering2WikiMultihopQA
LLM-Acc89.7
20
End-to-end Question Answering2WikiMultiHopQA (test val)
EM35.44
20
Knowledge-Intensive Reasoning2wikiMultiHopQA
F1 Score76.1
18
Knowledge-Intensive Reasoning2WikiMultiHopQA
Accuracy48.8
18
Multi-hop Question Answering2WikiMultiHopQA (dev test)
F1 Score81.5
17
Multi-hop Question Answering2WikiMultiHopQA (2WikiMQA) (official evaluation)
Exact Match (EM)31.8
17
Multi-Hop Question Answering2WikiMultiHopQA out-of-domain (val test)
Exact Match (EM)51.7
15
Agentic Search2WikiMultiHopQA
String-F169.9
14
Multi-hop Question Answering2WikiMultiHopQA in-domain (test)
Accuracy (Response)69.8
14
Question Answering2WikiMultiHopQA December 2018 Wikipedia dump (test)
EM28.6
14
Showing 25 of 55 rows