| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SCALE MultiChallenge | BRAID | Accuracy65.1 | 81 | 1mo ago | |
| LLaVA Bench (val) | MARS | Perplexity2.1875 | 44 | 1mo ago | |
| BBH (val) | G2IS | Accuracy65.81 | 42 | 1mo ago | |
| BBH | Accuracy85.93 | 40 | 1mo ago | ||
| Video-TT | OmniJigsaw (CMM) | Accuracy46.5 | 19 | 8d ago | |
| BBH | Acc83.03 | 16 | 11d ago | ||
| Humanity's Last Exam (HLE) | gemini-2.5-pro | Pass@1 Score18.4 | 13 | 1mo ago | |
| TOMATO | Qwen3-VL-8B + SynRL | Accuracy38.1 | 9 | 1mo ago | |
| BIG-bench Hard | FLAN-T5 | Orig Score39.3 | 7 | 1mo ago | |
| BBH | BBH Solution Rate67.4 | 6 | 1mo ago | ||
| TVbench | Qwen3-VL-8B + SynRL | Accuracy54.7 | 4 | 1mo ago | |
| cvbench | Qwen3-VL-4B + SynRL | Accuracy54.9 | 4 | 1mo ago | |
| vcrbench | Qwen3-VL-8B + SynRL | Accuracy35.6 | 4 | 1mo ago | |
| 3 Complex Reasoning (test) | LLM Calls3 | 2 | 23d ago | ||
| Natural Scenes Dataset (NSD) (test) | Neuro-Vision to Language | BLEU-165.41 | 2 | 1mo ago | |
| BIG-bench Hard Orig QA | - | Original Metric Value- | 0 | 1mo ago |