NLP Benchmark Suite

Benchmarks

Task Name	Dataset Name	SOTA Result
Question Answering and Commonsense Reasoning	NLP Benchmark Suite Zero-shot (HellaSwag, RACE, PIQA, WinoGrande, ARC, OBQA) (test)	HellaSwag Accuracy63.36	28
Language Modeling	NLP Benchmark Suite Aggregate	Average Delta-9.2	16
Aggregate NLP Evaluation	NLP Benchmark Suite Average	Average Accuracy64	9

Showing 3 of 3 rows