Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Task Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot General EvaluationZero-shot Task Suite (HellaSwag, MathQA, MMLU, OpenBookQA, WinoGrande, GSM8K, HumanEval)
HellaSwag Accuracy82.72
31
Common Sense Reasoning and Question AnsweringTask Suite Zero-shot (ARC-e, ARC-c, HellaSwag, OBQA, WinoGrande, MathQA, PIQA)
ARC-e83.54
17
Showing 2 of 2 rows