| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Model Merging | Average of 8 benchmarks | Average Accuracy52.79 | 72 | |
| Best-of-N Reranking | Average of 7 benchmarks (including AIME24, LeetCode) (test) | Average Accuracy52 | 42 | |
| Knowledge Assessment and Commonsense Reasoning | Average of 8 Benchmarks (ARC-C, ARC-E, BoolQ, HellaS, LamOp, Piqa, WinoG, MMLU) | Accuracy72.99 | 10 |