| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Question Answering | RWQA | Accuracy80.8 | 30 | |
| Visual Question Answering | RWQA 158 (val) | Score80.8 | 23 | |
| Multimodal Multi-choice | RWQA | Accuracy70.5 | 14 | |
| Real-world Spatial Understanding | RWQA | Top-1 Accuracy67.84 | 10 | |
| Real-world Multi-modal Question Answering | RWQA | Accuracy70.46 | 4 |