Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Benchmarks

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningReasoning Benchmarks BBH, MMLU, ARC-C, ThmQA (test)
BBH64.66
66
ReasoningReasoning Benchmarks ARC-e, ARC-c, BoolQ, PIQA, SIQA, HellaS., OBQA, Wino.
ARC-e Accuracy72.6
38
Reasoning15 reasoning benchmarks weighted mean (test)
Accuracy81.35
36
ReasoningReasoning Benchmarks GSM8K, MATH-500, AIME24, AIME25, GPQA-D
GSM8K Accuracy95.15
33
Common Sense Reasoning and Question AnsweringReasoning Benchmarks Zero-shot (PIQA, ARC, HellaSwag, WinoGrande)
PIQA Accuracy82.75
31
ReasoningReasoning Benchmarks Zero-shot
Overall Zero-Shot Accuracy69.99
26
General Reasoning EvaluationReasoning Benchmarks Aggregate
Average Score70.63
24
ReasoningReasoning Benchmarks GPQA-Diamond AIME2024 MATH500 HumanEval
Average Score85.77
21
Reasoning17 Reasoning Benchmarks Aggregate (test)
Accuracy90.71
21
Zero-shot evaluationReasoning Benchmarks Zero-shot (BoolQ, PIQA, HellaSwag, WinoGrande, ARC)
BoolQ Accuracy (Zero-shot)71.1
20
Mathematical ReasoningReasoning Benchmarks Overall
Delta Accuracy5.81
16
ReasoningReasoning Benchmarks MATH, GSM8K, AQUA, GSM-H, MMLU, MMLU-P, GPQA, AIME
MATH Accuracy84.4
14
Mathematical ReasoningReasoning Benchmarks Average
Average Accuracy44.7
12
ReasoningReasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3
Accuracy74.8
12
ReasoningReasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench)
GSM8K Accuracy85.12
9
ReasoningReasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) (test)
GSM8K Accuracy87.23
9
Zero-shot Common-sense ReasoningReasoning Benchmarks Zero-shot (ARC-e, ARC-c, BoolQ, PIQA, SIQA, HellaSwag, OBQA, WinoGrande)
ARC-e Accuracy74.2
8
Zero-shot ReasoningReasoning Benchmarks Zero-shot (ARC-C, ARC-E, HellaSwag, LAMBADA, OpenBookQA, PIQA, WinoGrande)
ARC-C Accuracy36.9
6
Mathematical Reasoning8 reasoning benchmarks (including GSM8K, MATH500, AIME 24, AIME 25, and OlympiadBench) (test)
Token Savings53.3
5
Mathematical ReasoningReasoning Benchmarks (AIME24, AIME25, AMC, MATH, Minerva, Olympiad) (test)
AIME24 Accuracy28.33
4
Multi-agent ReasoningReasoning Benchmarks Cooperative AutoGen framework (test)
Overall Accuracy83.58
2
Multi-agent ReasoningReasoning Benchmarks Competitive MAD framework (test)
Average Score0.8509
2
Showing 22 of 22 rows