| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BBH | DECENTMEM | Accuracy90.5 | 85 | 12d ago | |
| SCALE MultiChallenge | BRAID | Accuracy65.1 | 81 | 3mo ago | |
| LLaVA Bench (val) | MARS | Perplexity2.1875 | 44 | 3mo ago | |
| BBH (val) | G2IS | Accuracy65.81 | 42 | 3mo ago | |
| SciFact (test) | LLM annotation | Macro-F176.17 | 37 | 1d ago | |
| VitaminC (test) | EvoPool | Macro-F178.4 | 37 | 1d ago | |
| FEVER (test) | EvoPool | Macro F185.18 | 37 | 1d ago | |
| GAIA Text | Accuracy76.4 | 19 | 1mo ago | ||
| Video-TT | OmniJigsaw (CMM) | Accuracy46.5 | 19 | 1mo ago | |
| BBH | Acc83.03 | 16 | 1mo ago | ||
| Frames | Tongyi DeepResearch 30B | Accuracy90.6 | 13 | 1mo ago | |
| Humanity's Last Exam (HLE) | gemini-2.5-pro | Pass@1 Score18.4 | 13 | 3mo ago | |
| Arena-Hard 2.0 (test) | TCR | Overall Accuracy52.9 | 12 | 8d ago | |
| AQuA | Accuracy28.35 | 12 | 22d ago | ||
| TOMATO | Qwen3-VL-8B + SynRL | Accuracy38.1 | 9 | 2mo ago | |
| G-Bench Medical (val) | MemGraphRAG | Recall90.42 | 8 | 1d ago | |
| Seal-0 | Accuracy (Seal-0)53.4 | 8 | 1mo ago | ||
| BIG-bench Hard | FLAN-T5 | Orig Score39.3 | 7 | 3mo ago | |
| Big-Bench Hard & others | InfiGFusion | Abstract Algebra Score88 | 6 | 9d ago | |
| BBH | BBH Solution Rate67.4 | 6 | 3mo ago | ||
| SCoRE (test) | SelfBudgeter | Accuracy16.26 | 5 | 1mo ago | |
| SciFact (val) | EvoPool | Macro-F171.15 | 4 | 1d ago | |
| VitaminC (val) | EvoPool | Macro F1 Score69.99 | 4 | 1d ago | |
| FEVER (val) | EvoPool | Macro-F184.67 | 4 | 1d ago | |
| TVbench | Qwen3-VL-8B + SynRL | Accuracy54.7 | 4 | 2mo ago |