Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

lm-eval-harness

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot performance evaluationLM Eval Harness (HellaSwag, BoolQ, WinoGrande, PiQA, ARC-easy, ARC-challenge) zero-shot
Mean Accuracy75.46
60
Zero-shot Question Answering and ReasoningLM-Eval-Harness Suite (PIQA, HellaSwag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU) zero-shot
PIQA80.7
32
Question Answering and Commonsense Reasoninglm-eval-harness PIQA, COPA, OpenBookQA, Winogrande, SciQA, ARC-E, ARC-C
PIQA Accuracy78.8
10
General Language Model ReasoningLM-Eval-Harness Hungarian
Arc (hu) Acc38.6
4
Language Modeling UtilityLM Eval Harness
HellaSwag Accuracy0.48
3
Showing 5 of 5 rows