| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Real-world Question Answering | RWQA | RWQA Accuracy76.99 | 62 | |
| Visual Question Answering | RWQA | Accuracy80.8 | 47 | |
| Visual Question Answering | RWQA 158 (val) | Score80.8 | 23 | |
| Visual Grounding | RWQA | Accuracy72.29 | 22 | |
| Multimodal Understanding | RWQA | RWQA Score60.2 | 14 | |
| Multimodal Multi-choice | RWQA | Accuracy70.5 | 14 | |
| Real-world Spatial Understanding | RWQA | Top-1 Accuracy67.84 | 10 | |
| General Evaluation | RWQA | Score71.8 | 8 | |
| Robustness Evaluation | RWQA | Accuracy72.9 | 6 | |
| Real-world Multi-modal Question Answering | RWQA | Accuracy70.46 | 4 |