Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mathematical Reasoning

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningMathematical Reasoning Aggregate
Average Score46.63
18
Offline OOD DetectionMathematical Reasoning Far-shift OOD
AUROC96.54
14
Offline OOD DetectionMathematical Reasoning Near-shift OOD
AUROC98.76
14
Mathematical ReasoningMathematical Reasoning benchmarks (AIME2024, AMC, MATH-500, Minerva, Olympiad)
AIME 2024 Score33.5
12
Mathematical reasoningMathematical Reasoning (MATH500, AIME25, OlympiadBench, AMC23) 2023/2025 (test)
MATH Score86.2
12
Mathematical ReasoningMathematical Reasoning In-Distribution various (test)
AIME 24 Score33.4
12
OOD Quality EstimationMathematical Reasoning OOD (Near-shift)
Kendall Tau0.159
12
OOD Quality EstimationMathematical Reasoning Far-shift OOD
Kendall's Tau0.161
12
Mathematical ReasoningMathematical Reasoning Average (AMC23, Minerva, Olympiad, Math500, AIME24, AIME25)
Acc@1638.9
6
Mathematical ReasoningMathematical Reasoning (OOD)
Algebra222 Accuracy89.9
4
Mathematical ReasoningMathematical Reasoning (in-distribution)
GSM8K Score81.7
4
Mathematical ReasoningMathematical Reasoning
CA47.19
2
Showing 12 of 12 rows