Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Tasks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot ReasoningReasoning Tasks (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA) Zero-shot
BoolQ Accuracy (Zero-shot)82.813
55
Zero-shot ReasoningZero-Shot Reasoning Tasks (ARC-C, ARC-E, BoolQ, Hella, OBQA, PIQA, SIQA, Wino)
ARC-C Accuracy65.53
54
ReasoningReasoning Tasks Average
Average Score68.6
32
Single-turn ReasoningReasoning Tasks AIME24, AIME25, GPQA
AIME 2024 Accuracy92.2
18
Zero-shot EvaluationReasoning tasks
Reasoning Accuracy70.7
7
Model Ranking PredictionReasoning Tasks Aggregate
Spearman Rho0.81
6
Reasoning Chain OptimizationReasoning Tasks
Query Count47
3
Showing 7 of 7 rows