Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Downstream Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language Modeling and Zero-shot Multiple-Choice ReasoningDownstream Evaluation Suite Zero-shot (FW-Edu, Wiki., LAMBADA, PIQA, HellaSwag, WinoGrande, ARC, SIQA, SciQ) (val)
FW-Edu Perplexity10.52
9
Language ModelingDownstream Evaluation Suite (ARC-C, Hellaswag, PIQA, SciQ, Winograde, SocialIQA, RACE) zero-shot (test)
ARC-C Accuracy48.7
9
Zero-shot Downstream Task EvaluationDownstream Evaluation Suite (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ)
Arc-e53.83
4
Showing 3 of 3 rows