Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reasoning Benchmarks

Benchmarks

Task NameDataset NameSOTA ResultTrend
General Reasoning EvaluationReasoning Benchmarks Aggregate
Average Score70.63
24
Reasoning17 Reasoning Benchmarks Aggregate (test)
Accuracy90.71
21
Mathematical ReasoningReasoning Benchmarks Overall
Delta Accuracy5.81
16
ReasoningReasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3
Accuracy74.8
12
ReasoningReasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench)
GSM8K Accuracy85.12
9
ReasoningReasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) (test)
GSM8K Accuracy87.23
9
Multi-agent ReasoningReasoning Benchmarks Cooperative AutoGen framework (test)
Overall Accuracy83.58
2
Multi-agent ReasoningReasoning Benchmarks Competitive MAD framework (test)
Average Score0.8509
2
Showing 8 of 8 rows