Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NLP Benchmark Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question Answering and Commonsense ReasoningNLP Benchmark Suite Zero-shot (HellaSwag, RACE, PIQA, WinoGrande, ARC, OBQA) (test)
HellaSwag Accuracy63.36
28
Language ModelingNLP Benchmark Suite Aggregate
Average Delta-9.2
16
Aggregate NLP EvaluationNLP Benchmark Suite Average
Average Accuracy64
9
Showing 3 of 3 rows