| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond) | Primitives-based MAS | Average Accuracy77.6 | 21 | 4d ago | |
| Aggregate All tasks (summary) | EvoRoute | Score74.6 | 20 | 4d ago | |
| Aggregate (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande) (various) | Mistral (Full-Attention) | Avg Accuracy51.9 | 10 | 4d ago | |
| Average (test) | Hit Score52.45 | 6 | 4d ago | ||
| InternVL2-26B Task Suite Zeroing | TaLo | Persuasive Strategies Score58.3 | 2 | 4d ago |