Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NLP Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
General Language UnderstandingNLP Evaluation Suite (SciQ, PIQA, WG, ARC, HellaSwag, LogiQA, BoolQ, LAMBADA)
SciQ Accuracy58.3
14
Language Model EvaluationNLP Evaluation Suite (WG, PIQA, BoolQ, ARC-C, ARC-E, OBQA, HS, SciQ, LM, RTE)
WG60.14
6
Showing 2 of 2 rows