| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Aggregate Benchmarks | Gemini 2.5 Pro+ASP | Average Score93.9 | 37 | 21d ago | |
| AGIEval | Accuracy70.22 | 29 | 2mo ago | ||
| Average Across all benchmarks | BASTION | Speedup6.9 | 28 | 1d ago | |
| LiveBench | phi-balancing | Accuracy46.83 | 15 | 16d ago | |
| All Benchmarks | RESMERGE | Overall Average Score53.74 | 12 | 1d ago | |
| MM-VET | ECSO | REC39.5 | 12 | 3mo ago | |
| Aggregate Suite PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c | Average Score69 | 10 | 3mo ago | ||
| Reasoning, Knowledge, and Biomedicine combined datasets (test) | Reasoning | Average Score60.47 | 9 | 3mo ago | |
| RWQA | CoVT-7B | Score71.8 | 8 | 1d ago | |
| Downstream Suite | DCDM (MoE) | Average Score39.38 | 8 | 16d ago | |
| ChartQA | DeepLatent-RL-7B | Score86.4 | 7 | 1d ago | |
| Visual Probe-H | DeepLatent-RL-7B | Score38.7 | 7 | 1d ago | |
| Average Downstream Benchmark Suite | DoGraph | Average Accuracy37.9 | 7 | 1mo ago | |
| LiveBench 1125 | General Teacher | Score52.1 | 6 | 7d ago | |
| Instruction Tuning Suite (BIG-bench Hard, MMLU, TyDi QA, MGSM) | Flan-PaLM 2 (L) | Average Score74.1 | 4 | 3mo ago | |
| ExpSuite-Static Overall | ExpGraph | Average Score78.75 | 2 | 2d ago |