| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual World Modelling | WhatsUp | GPT-4o Score8.2 | 18 | |
| Visual Question Answering | WhatsUp | Accuracy99.2 | 10 | |
| Spatial Reasoning | WhatsUp-B | Binary Robust Accuracy99.9 | 9 | |
| Forward-dynamics Prediction | WhatsUp AURORA-BENCH | GPT-4o Score3.3 | 9 | |
| Action-centric Editing | WhatsUp (test) | Human Evaluation Score0.25 | 4 | |
| Vision-Language Spatial Reasoning | WhatsUp B-FB 2x2 directional variants | GroupScore66.67 | 3 | |
| Vision-Language Spatial Reasoning | WhatsUp B-LR 2x2 directional variants | GroupScore82.84 | 3 | |
| Vision-Language Spatial Reasoning | WhatsUp A-OU 2x2 directional variants | GroupScore99.03 | 3 | |
| Vision-Language Spatial Reasoning | WhatsUp A-LR 2x2 directional variants | GroupScore95.87 | 3 | |
| Vision-Language Compositionality | WhatsUp B (1 × 4 groups) | GroupScore49.94 | 2 | |
| Vision-Language Compositionality | WhatsUp A 1 × 4 groups | GroupScore56.8 | 2 |