Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on AIME 2024 (Accuracy, Length, and Length-Accuracy Metrics)
Loading...
36.7
Accuracy
Effi. Reasoning
26.3
29
31.7
34.4
Jun 9, 2025
Accuracy
Average Output Length
Length-Weighted Accuracy
Updated 12d ago
Evaluation Results
Method
Method
Links
Accuracy
Average Output Length
Length-Weighted Accuracy
Effi. Reasoning
Base Model=DeepSeek-R1...
2025.06
36.7
5,771
19.9
DAST
Base Model=DeepSeek-R1...
2025.06
36.7
5,400
21.4
kimi-k1.5
Base Model=DeepSeek-R1...
2025.06
33.3
5,159
20.3
Bingo-A
Base Model=DeepSeek-R1...
2025.06
33.3
2,943
26.7
Bingo-E
Base Model=DeepSeek-R1...
2025.06
33.3
2,943
26.7
Demystifying
Base Model=DeepSeek-R1...
2025.06
30
6,183
14.9
Vanilla PPO
Base Model=DeepSeek-R1...
2025.06
26.7
6,961
10.3
O1-Pruner
Base Model=DeepSeek-R1...
2025.06
26.7
5,958
13.9
Feedback
Search any
task
Search any
task