| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Language Understanding and Reasoning | General Benchmarks MMLU, HellaSwag, OBQA, WinoGrande, ARC-C, PiQA, SciQ, LogiQA | MMLU Accuracy35.68 | 70 | |
| General Multimodal Understanding | General Benchmarks | Average Score74 | 12 | |
| General Language Modeling | General Benchmarks Llama 3.1 8B | Generation Quality Score66.5 | 11 | |
| General Multimodal Reasoning | General Benchmarks | Top-1 Accuracy57.8 | 6 | |
| Natural Language Understanding and Reasoning | General Benchmarks Italian | ARC-C-it37.47 | 6 | |
| General Language Understanding | General Benchmarks (MMLU, AlpacaEval, Arena-Hard) | MMLU Accuracy73.41 | 4 | |
| General Language Evaluation | 12 general benchmarks Avg | General Average Score68.24 | 3 |