Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NLP Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot Language EvaluationZero-shot NLP Evaluation Suite (WikiText2, BoolQ, PIQA, HellaSwag, WinoGrande, ARC, OBQA, MTQA) (test)
WikiText2 Perplexity7.43
27
General Language UnderstandingNLP Evaluation Suite (SciQ, PIQA, WG, ARC, HellaSwag, LogiQA, BoolQ, LAMBADA)
SciQ Accuracy58.3
14
Language Model EvaluationNLP Evaluation Suite (WG, PIQA, BoolQ, ARC-C, ARC-E, OBQA, HS, SciQ, LM, RTE)
WG60.14
6
Showing 3 of 3 rows