Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reasoning Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot ReasoningReasoning Suite Zero-shot (PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c) (val test)
PIQA81.77
119
Commonsense ReasoningReasoning Suite Zero-shot Aggregate
Aggregate Score73.2
45
ReasoningReasoning Suite Average
Accuracy72.8
36
Zero-shot AccuracyReasoning Suite Zero-shot (PIQA, Hella Swag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU)
PIQA80.7
21
ReasoningReasoning Suite
GSM8K85.12
9
Zero-shot ReasoningReasoning Suite (ARC-e, ARC-c, HellaSwag, PIQA, Winogrande) zero-shot
ARC-e Accuracy0.7559
8
ReasoningReasoning Suite BBH, GPQA, MuSR
BBH83.4
7
Logical and Commonsense ReasoningReasoning Suite
BIG-Bench Hard89.36
4
Showing 8 of 8 rows