Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-Hop Question Answering on 2WikiMultiHopQA out-of-domain (val test)
Loading...
51.7
Exact Match (EM)
Search-R2
9.476
20.438
31.4
42.362
Feb 3, 2026
Exact Match (EM)
Updated 4d ago
Evaluation Results
Method
Method
Links
Exact Match (EM)
Search-R2
Backbone=Qwen2.5-32B
2026.02
51.7
Search-R1
Backbone=Qwen2.5-32B
2026.02
46.2
Search-R2
Backbone=Qwen3-8B
2026.02
40.5
Search-R2
Backbone=Qwen2.5-7B
2026.02
35.8
Search-R1
Backbone=Qwen3-8B
2026.02
35.5
Search-R1
Backbone=Qwen2.5-7B
2026.02
29.7
Rejection Sampling
Backbone=Qwen2.5-7B
2026.02
29.6
R1-instruct
Backbone=Qwen2.5-7B
2026.02
29.2
R1-base
Backbone=Qwen2.5-7B
2026.02
27.3
SFT
Backbone=Qwen2.5-7B
2026.02
25.9
Direct Inference
Backbone=Qwen2.5-7B
2026.02
25
RAG
Backbone=Qwen2.5-7B
2026.02
23.5
Search-o1
Backbone=Qwen2.5-7B
2026.02
17.6
IRCoT
Backbone=Qwen2.5-7B
2026.02
14.9
CoT
Backbone=Qwen2.5-7B
2026.02
11.1
Feedback
Search any
task
Search any
task