Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningReasoning Evaluation Suite Math, Symbolic, and Commonsense (test)
Math Accuracy80.8
33
ReasoningReasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA
AIME 2024 Score60
32
ReasoningReasoning Evaluation Suite (MATH, GSM8K, AQUA, GSM-H, MMLU, MMLU-P, AIME) (test)
MATH Score52.4
8
Showing 3 of 3 rows