Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reasoning Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningReasoning Evaluation Suite Math, Symbolic, and Commonsense (test)
Math Accuracy80.8
33
ReasoningReasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA
AIME 2024 Score60
32
Showing 2 of 2 rows