Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Math Reasoning

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningMath Reasoning Overall
Mean@1663.8
24
Mathematical ReasoningMath Reasoning AIME24, AIME25, HMMT25
AIME24 Score78.4
24
Preference ModelingMath Reasoning
Accuracy87.6
20
Math ReasoningMath Reasoning Long Q, Long A (test)
Pass@10.65
15
Mathematical ReasoningMath Reasoning Out-domain (SVAMP, Mathematics, SimulEq) (test)
SVAMP Accuracy79.6
8
Mathematical ReasoningMath Reasoning In-domain (GSM8K, MATH, NumGLUE) (test)
GSM8K Accuracy69.1
8
Math ReasoningOverall Average Math Reasoning
Pass@154.54
6
Math ReasoningMath Reasoning Aggregate
Avg@3240.08
6
Preference ClassificationMath Reasoning (test)
Classification Accuracy85.4
4
Math ReasoningMath Reasoning 1.5B model (val)
Validation Accuracy69.4
3
Showing 10 of 10 rows