Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop Question Answering on HotpotQA (official evaluation)
Loading...
33.2
EM Accuracy
M3PO
8.24
14.72
21.2
27.68
Dec 1, 2025
EM Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
EM Accuracy
M3PO
Backbone=Qwen2.5-3B-In...
2025.12
33.2
HRPO
Backbone=Qwen2.5-3B-In...
2025.12
31.6
GRPO
Backbone=Qwen2.5-3B-In...
2025.12
30.8
PPO
Backbone=Qwen2.5-3B-In...
2025.12
30.4
RAG
Backbone=Qwen2.5-7B-In...
2025.12
29.9
M3PO
Backbone=Qwen2.5-1.5B-...
2025.12
28.7
HRPO
Backbone=Qwen2.5-1.5B-...
2025.12
27.3
PPO
Backbone=Qwen2.5-1.5B-...
2025.12
25.6
RAG
Backbone=Qwen2.5-3B-In...
2025.12
25.5
RAG
Backbone=Qwen2.5-1.5B-...
2025.12
22.8
GRPO
Backbone=Qwen2.5-1.5B-...
2025.12
20.2
Search-o1
Backbone=Qwen2.5-7B-In...
2025.12
18.7
SFT
Backbone=Qwen2.5-3B-In...
2025.12
18.6
QA
Backbone=Qwen2.5-7B-In...
2025.12
18.3
IRCoT
Backbone=Qwen2.5-7B-In...
2025.12
13.3
SFT
Backbone=Qwen2.5-1.5B-...
2025.12
12.9
CoT
Backbone=Qwen2.5-7B-In...
2025.12
9.2
Feedback
Search any
task
Search any
task