| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Image Captioning Evaluation | Composite | Kendall-c Tau_c66 | 92 | |
| Property Prediction | Composite | RMSE (Yield)139.532 | 24 | |
| Caption-level correlation with human judgment | Composite (test) | Kendall's Tau0.6 | 21 | |
| Correlation with human judgments | Composite (test) | Kendall's Tau-c57.6 | 18 | |
| Image Captioning Evaluation | COMPOSITE (COM) (test) | Kendall's tau-c64.2 | 17 | |
| Correlation with human judgment | Composite 1 (test) | Kendall Tau-c57.3 | 15 | |
| Agent & Alignment | Composite IFEval-strict-prompt, BFCL v3, CodeIF-Bench, Nexus FC | IFEval Strict Prompt Score86.9 | 4 | |
| Math | Composite (GSM8K, MATH, OlympiadBench, AIME 2025, HARDMath2, Omni-MATH, GSM-Plus, CMATH) | GSM8K94.62 | 4 | |
| Coding | Composite CRUXEval-O, MBPP, MBPP+, MultiPL-E, HumanEval, HumanEval+, HumanEvalFix, HumanEval-cn, BigCodeBench-Full, LiveCodeBench, Aider, BIRD-SQL, Spider | CRUXEval-O Score76.12 | 4 | |
| Reasoning | Composite (BIG-Bench Hard, BIG-Bench Extra Hard, bbh-zh, MuSR, ZebraLogic, PrOntoQA, PIQA, OCNLI, HellaSwag, KOR-Bench, DROP, SQuAD 2.0) | BBH83.7 | 4 | |
| Knowledge Evaluation | Composite (MMLU, MMLU-Pro, CMMLU, C-EVAL, GAOKAO-Bench, ARC-c, GPQA, SciBench, PHYBench, TriviaQA) | Overall Average Score65.77 | 4 |