Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LM Evaluation Harness

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot Language ModelingLM Evaluation Harness 0-shot
WG80.66
30
Multi-task Language UnderstandingLM Evaluation Harness (test)
ARC Challenge Acc44.28
24
Language ModelingLM Evaluation Harness (LM Eval) (test)
WG (Winograd Schema)74.11
22
Natural Language UnderstandingLM Evaluation Harness
MMLU (CoT)72.76
19
Downstream Task EvaluationLM Evaluation Harness MMLU, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, GSM8K standard
MMLU65.8
16
World Knowledge and Reading ComprehensionLM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ
NQ Accuracy29.81
15
Zero-shot Evaluationlm-evaluation-harness (SciQ, ARC-E, ARC-C, LogiQA, OBQA, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot
SciQ Accuracy68.2
13
Zero-shot Natural Language UnderstandingLM-Evaluation-Harness ARC, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SciQ, Record, OBQA
ARC Challenge46.8
13
Language Understanding and ReasoningLM-Evaluation-Harness ARC-c, ARC-e, BoolQ, HellaS., MMLU, OBQA, PIQA, WG
ARC-c Accuracy58.4
12
Language Model Evaluation SuiteLM Evaluation Harness
Avg Accuracy66.6
8
Showing 10 of 10 rows