Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

lm-eval-harness

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot performance evaluationLM Eval Harness (HellaSwag, BoolQ, WinoGrande, PiQA, ARC-easy, ARC-challenge) zero-shot
Mean Accuracy75.46
60
Zero-shot Question Answering and ReasoningLM-Eval-Harness Suite (PIQA, HellaSwag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU) zero-shot
PIQA80.7
32
Zero-shot EvaluationLM-Eval-Harness
LAMBADA Perplexity (PPL)17.8
10
Multiple Choice Question AnsweringLM-Eval-Harness MMLU, ARC-Easy, HellaSwag, PIQA, OpenBookQA, WinoGrande
MMLU Accuracy24.3
10
Question Answering and Commonsense Reasoninglm-eval-harness PIQA, COPA, OpenBookQA, Winogrande, SciQA, ARC-E, ARC-C
PIQA Accuracy78.8
10
Language Model Evaluationlm-eval-harness (test)
MMLU74.22
9
General Language Model ReasoningLM-Eval-Harness Hungarian
Arc (hu) Acc38.6
4
Language Modeling UtilityLM Eval Harness
HellaSwag Accuracy0.48
3
Showing 8 of 8 rows