Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Explorative Reasoning on Game of 24 (test)
Loading...
80
Accuracy
RTR
17.6
33.8
50
66.2
Mar 6, 2026
Accuracy
Average Output Tokens
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
Average Output Tokens
RTR
Backbone={Qwen3-4B, 8B...
2026.03
80
4,623
RouteGoT
Backbone={Qwen3-4B, 8B...
2026.03
80
3,648
RouteLLM
Backbone={Qwen3-4B, 8B...
2026.03
79
4,849
AGoT
Backbone=Qwen3-30B
2026.03
74
18,406
GoT*
Backbone=Qwen3-30B
2026.03
72
17,396
KNN
Backbone={Qwen3-4B, 8B...
2026.03
71
4,542
Random
Backbone={Qwen3-4B, 8B...
2026.03
62
7,899
CoT
Backbone=Qwen3-30B
2026.03
58
1,948
EmbedLLM
Backbone={Qwen3-4B, 8B...
2026.03
58
9,668
IO
Backbone=Qwen3-30B
2026.03
27
380
ToT
Backbone=Qwen3-30B
2026.03
20
2,855
Feedback
Search any
task
Search any
task