Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LM Evaluation Harness

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot Language ModelingLM Evaluation Harness 0-shot
WG80.66
30
Multi-task Language UnderstandingLM Evaluation Harness (test)
ARC Challenge Acc44.28
24
Language ModelingLM Evaluation Harness (LM Eval) (test)
WG (Winograd Schema)74.11
22
Natural Language UnderstandingLM Evaluation Harness
MMLU (CoT)72.76
19
Downstream Task EvaluationLM Evaluation Harness MMLU, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, GSM8K standard
MMLU65.8
16
World Knowledge and Reading ComprehensionLM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ
NQ Accuracy29.81
15
Zero-shot Evaluationlm-evaluation-harness (SciQ, ARC-E, ARC-C, LogiQA, OBQA, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot
SciQ Accuracy68.2
13
Zero-shot Natural Language UnderstandingLM-Evaluation-Harness ARC, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SciQ, Record, OBQA
ARC Challenge46.8
13
Language Understanding and ReasoningLM-Evaluation-Harness ARC-c, ARC-e, BoolQ, HellaS., MMLU, OBQA, PIQA, WG
ARC-c Accuracy58.4
12
Language Modeling EvaluationLM Evaluation Harness
ARC53.77
11
Language Model Evaluation SuiteLM Evaluation Harness
Avg Accuracy66.6
8
Natural Language UnderstandingLM Evaluation Harness
WG Score57.2
5
Showing 12 of 12 rows