Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LM Evaluation Harness

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language Model Downstream EvaluationLM Evaluation Harness zero-shot and five-shot
HellaSwag Acc72.52
30
Zero-shot Language ModelingLM Evaluation Harness 0-shot
WG80.66
30
Multi-task Language UnderstandingLM Evaluation Harness (test)
ARC Challenge Acc44.28
24
Language ModelingLM Evaluation Harness (LM Eval) (test)
WG (Winograd Schema)74.11
22
Downstream task evaluationLM Evaluation Harness MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, CommonsenseQA
MMLU70.5
19
Natural Language UnderstandingLM Evaluation Harness
MMLU (CoT)72.76
19
Zero-shot Evaluationlm-evaluation-harness (SciQ, ARC-E, ARC-C, LogiQA, OBQA, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot
SciQ Accuracy69.4
19
Zero-shot Language UnderstandingLM Evaluation Harness Downstream Suite (HellaSwag, PIQA, WinoGrande, OpenBookQA, SIQA, BoolQ, TriviaQA, MMLU, ARC-Challenge, ARC-Easy, MathQA, SciQ)
HellaSwag Accuracy72.52
18
Zero-shot EvaluationLM Evaluation Harness PIQA, HellaSwag, COPA, RTE, OpenBookQA, LAMBADA-OpenAI
Average Score75.97
16
Language Modeling EvaluationLM Evaluation Harness
Accuracy60.35
16
Downstream Task EvaluationLM Evaluation Harness MMLU, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, GSM8K standard
MMLU65.8
16
World Knowledge and Reading ComprehensionLM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ
NQ Accuracy29.81
15
Commonsense ReasoningLM-Evaluation-Harness Commonsense Reasoning: LAMBADA, WikiText, ARC, HellaSwag, PIQA, WinoGrande, BoolQ, SciQ
LAMBADA Perplexity (PPL)11.86
13
Zero-shot Natural Language UnderstandingLM-Evaluation-Harness ARC, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SciQ, Record, OBQA
ARC Challenge46.8
13
Natural Language Understandinglm-evaluation-harness suite (HellaSwag, RACE, PIQA, WinoGrande, ARC-e, ARC-c, OBQA)
HellaSwag57.18
12
Language Understanding and ReasoningLM-Evaluation-Harness ARC-c, ARC-e, BoolQ, HellaS., MMLU, OBQA, PIQA, WG
ARC-c Accuracy58.4
12
Language Model Evaluation SuiteLM Evaluation Harness
Avg Accuracy66.6
8
Natural Language UnderstandingLM Evaluation Harness
WG Score57.2
5
Downstream Evaluationlm-evaluation-harness ARC-E, BoolQ, HellaSwag, OBQA, SciQ
ARC-E Accuracy0.381
4
Zero-shot downstream evaluationLM Evaluation Harness 0-shot v1.0.0
HellaSwag Accuracy50.1
4
Showing 20 of 20 rows