| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond) | Primitives-based MAS | Average Accuracy77.6 | 21 | 1mo ago | |
| Aggregate All tasks (summary) | EvoRoute | Score74.6 | 20 | 1mo ago | |
| Fairness and Utility Suite | Self-Debias Iter2 + Self-Correction | Average Score82.1 | 16 | 8d ago | |
| Average GSM8K, HumanEval, ARC-c | ReMix | Accuracy60.77 | 13 | 1mo ago | |
| Aggregated Clinical Tasks | Average Score74.6 | 12 | 9d ago | ||
| Aggregate (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande) (various) | Mistral (Full-Attention) | Avg Accuracy51.9 | 10 | 1mo ago | |
| Average (test) | Hit Score52.45 | 6 | 1mo ago | ||
| VLM Tasks Average | Accuracy86.7 | 3 | 25d ago | ||
| InternVL2-26B Task Suite Zeroing | TaLo | Persuasive Strategies Score58.3 | 2 | 1mo ago |