Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Benchmark Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningReasoning Benchmark Suite Aggregate
Average Score59.44
36
General ReasoningReasoning Benchmark Suite (GSM8K, MATH500, GPQA, CSQA, AQuA, MMLU)
Average Accuracy86.61
7
Mathematical and Science ReasoningReasoning Benchmark Suite (MATH500, GSM8K, AMC23, Minerva, MMLU, MMLU-Pro, GPQA)
MATH500 Score81.15
2
Showing 3 of 3 rows