Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-Hop Question Answering on HotpotQA in-domain (val test)
Loading...
49.9
Exact Match (EM)
Search-R2
7.572
18.561
29.55
40.539
Feb 3, 2026
Exact Match (EM)
Updated 4d ago
Evaluation Results
Method
Method
Links
Exact Match (EM)
Search-R2
Backbone=Qwen2.5-32B
2026.02
49.9
Search-R1
Backbone=Qwen2.5-32B
2026.02
43.3
Search-R2
Backbone=Qwen3-8B
2026.02
41.2
Search-R2
Backbone=Qwen2.5-7B
2026.02
39
Search-R1
Backbone=Qwen3-8B
2026.02
37.2
Rejection Sampling
Backbone=Qwen2.5-7B
2026.02
33.1
Search-R1
Backbone=Qwen2.5-7B
2026.02
32.6
RAG
Backbone=Qwen2.5-7B
2026.02
29.9
R1-base
Backbone=Qwen2.5-7B
2026.02
24.2
R1-instruct
Backbone=Qwen2.5-7B
2026.02
23.7
SFT
Backbone=Qwen2.5-7B
2026.02
21.7
Search-o1
Backbone=Qwen2.5-7B
2026.02
18.7
Direct Inference
Backbone=Qwen2.5-7B
2026.02
18.3
IRCoT
Backbone=Qwen2.5-7B
2026.02
13.3
CoT
Backbone=Qwen2.5-7B
2026.02
9.2
Feedback
Search any
task
Search any
task