Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LM Eval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot downstream task evaluationLM-EVAL (Average of HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande) zero-shot latest
Average Accuracy76
30
Question Answering and Commonsense ReasoningLM Eval ARCC, ARCE, HellaSwag, PIQA 0.4.4 standard (test)
ARCC61.6
18
Commonsense Reasoning and Knowledgelm-eval ARC-C, BoolQ, Lambada, PIQA, Winogrande
ARC-C Accuracy53.58
8
Showing 3 of 3 rows