| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Large Language Model Reasoning | 3 LLM Tasks (CMMLU, GSM8K, HumanEval) (test) | Average Accuracy40.4 | 7 | |
| Large Language Modeling | 3 LLM Tasks Aggregate LLaMa2 (average) | Accuracy0.405 | 6 | |
| Language Modeling Evaluation | Eight benchmark LLM tasks | Throughput (Tokens/s)49,781.23 | 5 |