Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-Hop Reasoning on MuSiQue (Accuracy and Search Efficiency)
Loading...
10.4
Accuracy
BES
1.768
4.009
6.25
8.491
May 27, 2026
Accuracy
Valid Search Count
Valid Action Count
Finish Ratio
Updated 6d ago
Evaluation Results
Method
Method
Links
Accuracy
Valid Search Count
Valid Action Count
Finish Ratio
BES
Backbone=Llama-3.1-8B-...
2026.05
10.4
2.11
3.05
94
Tree-GRPO
Backbone=Llama-3.1-8B-...
2026.05
7.4
0.65
1.36
71
BES
Backbone=Llama-3.2-3B-...
2026.05
7
2.31
3.29
97
Base model
Backbone=Llama-3.1-8B-...
2026.05
6.6
-
-
-
GRPO
Backbone=Llama-3.1-8B-...
2026.05
5.6
1.46
1.83
37
Base model
Backbone=Llama-3.2-3B-...
2026.05
4
-
-
-
Tree-GRPO
Backbone=Llama-3.2-3B-...
2026.05
3.9
1.5
2.14
64
GRPO
Backbone=Llama-3.2-3B-...
2026.05
2.1
0.84
0.2
64
Feedback
Search any
task
Search any
task