Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-Hop Question Answering on Musique out-of-domain (val test)
Loading...
25.4
Exact Match (EM)
Search-R2
1.272
7.536
13.8
20.064
Feb 3, 2026
Exact Match (EM)
Updated 4d ago
Evaluation Results
Method
Method
Links
Exact Match (EM)
Search-R2
Backbone=Qwen2.5-32B
2026.02
25.4
Search-R1
Backbone=Qwen2.5-32B
2026.02
22.1
Search-R2
Backbone=Qwen3-8B
2026.02
17.2
Search-R1
Backbone=Qwen3-8B
2026.02
15.7
Search-R2
Backbone=Qwen2.5-7B
2026.02
15.1
Search-R1
Backbone=Qwen2.5-7B
2026.02
12.5
Rejection Sampling
Backbone=Qwen2.5-7B
2026.02
12.3
R1-base
Backbone=Qwen2.5-7B
2026.02
8.3
IRCoT
Backbone=Qwen2.5-7B
2026.02
7.2
R1-instruct
Backbone=Qwen2.5-7B
2026.02
7.2
SFT
Backbone=Qwen2.5-7B
2026.02
6.6
Search-o1
Backbone=Qwen2.5-7B
2026.02
5.8
RAG
Backbone=Qwen2.5-7B
2026.02
5.8
Direct Inference
Backbone=Qwen2.5-7B
2026.02
3.1
CoT
Backbone=Qwen2.5-7B
2026.02
2.2
Feedback
Search any
task
Search any
task