Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on GSM8K (test) (Acc, Tok, Lat)
Loading...
95.3
Accuracy (GSM8K)
Think
52.972
63.961
74.95
85.939
Feb 7, 2026
Accuracy (GSM8K)
Tokens Used (GSM8K)
Latency (GSM8K)
Updated 25d ago
Evaluation Results
Method
Method
Links
Accuracy (GSM8K)
Tokens Used (GSM8K)
Latency (GSM8K)
Think
Base Model=Qwen3-4B-Th...
2026.02
95.3
1,414
155.6
NoThink*
Base Model=Qwen3-4B-Th...
2026.02
95.2
1,631
204.2
EAGLE3
Base Model=Qwen3-4B-Th...
2026.02
94.8
1,408
140.3
DEER
Base Model=Qwen3-4B-Th...
2026.02
94.3
960
230.3
SpecExit
Base Model=Qwen3-4B-Th...
2026.02
93.8
649
75.8
EAGLE3
Base Model=DeepSeek-R1...
2026.02
79.3
976
276.9
Vanilla
Base Model=DeepSeek-R1...
2026.02
76.4
1,008
629.4
SpecExit
Base Model=DeepSeek-R1...
2026.02
75.3
333
112.6
DEER
Base Model=DeepSeek-R1...
2026.02
74.7
710
484.8
NoThink
Base Model=DeepSeek-R1...
2026.02
54.6
233
22.2
Feedback
Search any
task
Search any
task