Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Benchmarks

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningReasoning Benchmarks GSM8K, MATH-500, AIME24, AIME25, GPQA-D
GSM8K Accuracy95.15
33
Common Sense Reasoning and Question AnsweringReasoning Benchmarks Zero-shot (PIQA, ARC, HellaSwag, WinoGrande)
PIQA Accuracy82.75
31
General Reasoning EvaluationReasoning Benchmarks Aggregate
Average Score70.63
24
Reasoning17 Reasoning Benchmarks Aggregate (test)
Accuracy90.71
21
Zero-shot evaluationReasoning Benchmarks Zero-shot (BoolQ, PIQA, HellaSwag, WinoGrande, ARC)
BoolQ Accuracy (Zero-shot)71.1
20
ReasoningReasoning Benchmarks Zero-shot
PIQA Accuracy80.79
16
Mathematical ReasoningReasoning Benchmarks Overall
Delta Accuracy5.81
16
ReasoningReasoning Benchmarks MATH, GSM8K, AQUA, GSM-H, MMLU, MMLU-P, GPQA, AIME
MATH Accuracy84.4
14
ReasoningReasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3
Accuracy74.8
12
ReasoningReasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench)
GSM8K Accuracy85.12
9
ReasoningReasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) (test)
GSM8K Accuracy87.23
9
Mathematical ReasoningReasoning Benchmarks Average
Average Accuracy44.7
2
Multi-agent ReasoningReasoning Benchmarks Cooperative AutoGen framework (test)
Overall Accuracy83.58
2
Multi-agent ReasoningReasoning Benchmarks Competitive MAD framework (test)
Average Score0.8509
2
Showing 14 of 14 rows