Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM Tasks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Large Language Model Reasoning3 LLM Tasks (CMMLU, GSM8K, HumanEval) (test)
Average Accuracy40.4
7
Large Language Modeling3 LLM Tasks Aggregate LLaMa2 (average)
Accuracy0.405
6
Language Modeling EvaluationEight benchmark LLM tasks
Throughput (Tokens/s)49,781.23
5
Showing 3 of 3 rows