Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-hop Question Answering on HotpotQA (official evaluation)
Loading...
33.2
EM Accuracy
M3PO
8.24
14.72
21.2
27.68
Dec 1, 2025
EM Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
EM Accuracy
M3PO
Backbone=Qwen2.5-3B-In...
2025.12
33.2
HRPO
Backbone=Qwen2.5-3B-In...
2025.12
31.6
GRPO
Backbone=Qwen2.5-3B-In...
2025.12
30.8
PPO
Backbone=Qwen2.5-3B-In...
2025.12
30.4
RAG
Backbone=Qwen2.5-7B-In...
2025.12
29.9
M3PO
Backbone=Qwen2.5-1.5B-...
2025.12
28.7
HRPO
Backbone=Qwen2.5-1.5B-...
2025.12
27.3
PPO
Backbone=Qwen2.5-1.5B-...
2025.12
25.6
RAG
Backbone=Qwen2.5-3B-In...
2025.12
25.5
RAG
Backbone=Qwen2.5-1.5B-...
2025.12
22.8
GRPO
Backbone=Qwen2.5-1.5B-...
2025.12
20.2
Search-o1
Backbone=Qwen2.5-7B-In...
2025.12
18.7
SFT
Backbone=Qwen2.5-3B-In...
2025.12
18.6
QA
Backbone=Qwen2.5-7B-In...
2025.12
18.3
IRCoT
Backbone=Qwen2.5-7B-In...
2025.12
13.3
SFT
Backbone=Qwen2.5-1.5B-...
2025.12
12.9
CoT
Backbone=Qwen2.5-7B-In...
2025.12
9.2
Feedback
Search any
task
Search any
task