Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-step Reasoning on GSM-Hard
Loading...
71.5
Accuracy
eMoT
13.676
28.688
43.7
58.712
Jun 1, 2026
Accuracy
Updated 1d ago
Evaluation Results
Method
Method
Links
Accuracy
eMoT
Models=Qwen-32B
2026.06
71.5
BoT
Models=GPT-4
2026.06
62
PaL
Models=Codex (175B)
2026.06
61.2
ToT
Models=GPT-4
2026.06
24.4
Qwen-32B (Direct)
Models=Qwen-32B
2026.06
15.9
Feedback
Search any
task
Search any
task