| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Question Answering | RealworldQA | Accuracy80.2 | 179 | |
| Real-world Visual Question Answering | RealWorldQA | Accuracy77.8 | 140 | |
| Real-world Question Answering | RealWorldQA | Overall Score78.7 | 58 | |
| Real-world Multimodal Reasoning | RealWorldQA | Accuracy75.4 | 57 | |
| Real-world Visual Understanding | RealWorldQA | Accuracy81.4 | 47 | |
| Spatial Reasoning | RealWorldQA | Accuracy69.67 | 45 | |
| Vision-centric Reasoning | RealWorldQA | Accuracy75.4 | 38 | |
| Visual Question Answering | RealWorldQA (test) | Accuracy79 | 36 | |
| Real-world QA | RealworldQA | Accuracy73.1 | 33 | |
| Spatial Understanding | RealWorldQA | RWQA Score66.01 | 30 | |
| Multimodal Understanding | RealWorldQA | RWQA Score78 | 30 | |
| Real-world Visual Understanding | RealWorldQA | Score72.29 | 29 | |
| General Visual Understanding | RealWorldQA | Accuracy67.58 | 28 | |
| General Reasoning & Understanding | RealWorldQA | Accuracy (RealWorldQA)72.6 | 21 | |
| General Visual Question Answering | RealWorldQA | Score73.1 | 20 | |
| Real-world Multimodal Interaction | RealWorldQA (test) | Accuracy77.8 | 18 | |
| Vision Understanding | RealworldQA | Overall Score75.4 | 17 | |
| Visual Question Answering | RealWorldQA (RWQA) | Score68.5 | 16 | |
| Real-world Multimodal Interaction | RealWorldQA | RealWorldQA Score76.5 | 15 | |
| Visual Question Answering | RealWorldQA 1.0 (test) | Accuracy0.6353 | 15 | |
| Vision-Centric Understanding | RealworldQA | Accuracy75.4 | 10 | |
| Short-answer Visual Question Answering | RealWorldQA | Accuracy65.1 | 9 | |
| Real-world understanding | RealWorldQA | Score70.07 | 9 | |
| Real-world Image QA | RealworldQA | Score60.52 | 7 | |
| Real-world QA | RealworldQA v1.0 (test) | Score75.5 | 7 |