Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot ReasoningReasoning Suite Zero-shot (PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c) (val test)
Average Accuracy76.55
297
Zero-shot ReasoningReasoning Suite (ARC-e, ARC-c, HellaSwag, PIQA, Winogrande) zero-shot
Average Reasoning Score6,540
107
Commonsense ReasoningReasoning Suite Zero-shot Aggregate
Aggregate Score73.2
50
ReasoningReasoning Suite Average
Accuracy74.8
45
Reasoning and Language ModelingReasoning Suite (ARC, HellaSwag, PIQA, WinoGrande, MMLU, OpenBookQA, Real-world QA) Zero-shot Llama-3.1-8B-Instruct with Alpaca calibration
PPL9.63
32
Zero-shot Language UnderstandingReasoning Suite Zero-shot (BoolQ, WinoG., PIQA, OBQA, HellaS., ARC-e, ARC-c)
BoolQ Accuracy82.63
24
Zero-shot AccuracyReasoning Suite Zero-shot (PIQA, Hella Swag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU)
PIQA80.7
21
Zero-shot ReasoningReasoning Suite PiQA, LAMBDA, ARC, HellaSwag
PiQA Score62.69
20
Question AnsweringReasoning Suite Zero-shot (ArcC, ArcE, PiQA, Wino)
Arc Challenge (C) Accuracy50.43
16
Zero-shot LearningReasoning Suite Zero-shot (ARC-e, ARC-c, WG, BQ, PIQA, HS, OBQA, HQA)
ARC-e Accuracy49.7
9
Zero-shot ReasoningReasoning Suite Zero-shot (PIQA, ARC, HS, WG, BoolQ, MMLU)
PIQA Accuracy80.2
9
Zero-shot Commonsense ReasoningReasoning Suite Zero-shot (ARC-E, BoolQ, HSwag, LAMBADA, OBQA, PIQA, SocIQA, WinoGr.)
ARC-E Accuracy45.5
9
ReasoningReasoning Suite
GSM8K85.12
9
ReasoningReasoning Suite BBH, GPQA, MuSR
BBH83.4
7
ReasoningReasoning Suite (MMLU-Pro, GPQA Diamond, AIME-24, AIME-25) zero-shot
MMLU-Pro Accuracy74.8
6
Zero-shot ReasoningReasoning Suite Zero-shot (PIQA, ARCe, ARCc, BoolQ, Hella., Wino.)
PIQA Accuracy79.05
4
Logical and Commonsense ReasoningReasoning Suite
BIG-Bench Hard89.36
4
Showing 17 of 17 rows