| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Performance | Aggregated Benchmarks | Overall Average49.76 | 22 | |
| Quantization Performance Summary | Aggregated Benchmarks HellaSwag, MMLU, Arc-C, MATH-500 | Average Score1.014 | 22 | |
| Meta-Evaluation | Aggregated Benchmarks (AIME, ARC, GSM8K, HE, MMLU, IT, RU, BFCL) | Overall Average Rank2.4 | 16 | |
| General Multimodal Understanding | Aggregated Benchmarks | Average Score71 | 13 | |
| Summary Evaluation | Aggregated Benchmarks | Average Score (3-item avg)42.2 | 12 | |
| Reward Modeling | Aggregated Benchmarks Macro | Average Score (excl. MM-RB, VL-RB)74.74 | 12 | |
| General Language Evaluation | Aggregated Benchmarks | Average Score0.7449 | 10 | |
| Overall Language Model Evaluation | Aggregated Benchmarks STEM Code IF General | Average Score61.7 | 7 | |
| General Multitask Evaluation | Aggregated Benchmarks Math500, GPQA, HumanEval, MBPP, AE2 LC | Average Score40.7 | 5 |