| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Overall Vision-Language Performance | Vision-Language Tasks Aggregate | Targeted ASR86.8 | 18 | |
| Image Captioning | Vision-Language Tasks Captioning | Targeted ASR68.9 | 18 | |
| Image Classification | Vision-Language Tasks Classification | Targeted ASR91 | 18 | |
| VQA (General) | Vision-Language Tasks General VQA | Targeted ASR98.4 | 18 |