| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Large Language Model Evaluation | HuggingFace Open LLM Leaderboard | GSM8K55.37 | 49 | |
| Large Language Model Evaluation | HuggingFace Open LLM Leaderboard lm-eval-harness default (various) | HellaSwag84.34 | 36 | |
| General language understanding and reasoning | Huggingface Open LLM Leaderboard | HellaSwag Accuracy85.32 | 30 | |
| LLM Evaluation | HuggingFace Open LLM Leaderboard Old (test) | GSM8K Score92.08 | 14 | |
| General Language Understanding | HuggingFace Open LLM Leaderboard New | BBH68.84 | 7 |