Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop Question Answering on HotpotQA 200 held-out questions
Loading...
91.5
Accuracy
MAGE
69.66
75.33
81
86.67
May 11, 2026
Accuracy
Delta to Strongest Baseline
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy
Delta to Strongest Baseline
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
91.5
7
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
88.5
7
0-shot CoT
Backbone=Qwen3-8B, Jud...
2026.05
84.5
-
SC10
Backbone=Qwen3-8B, Jud...
2026.05
80.5
-
8-shot
Backbone=Qwen3-8B, Jud...
2026.05
78.5
-
Reflexion
Backbone=Qwen3-8B, Jud...
2026.05
78.5
-
ReAct
Backbone=Qwen3-8B, Jud...
2026.05
70.5
-
Feedback
Search any
task
Search any
task