| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multimodal Understanding and Question Answering | Multimodal Benchmarks MME, OCRBench, DocVQA, RealWorldQA, VLMBlind | MME Score2,386 | 33 | |
| Multimodal Question Answering | 9 Multimodal Benchmarks (VQAv2, GQA, VizWiz, SQA-IMG, TextVQA, POPE, MME, MMB, MMB-CN) (test val) | VQAv2 Accuracy80 | 15 | |
| Multimodal In-context Learning | Multimodal Benchmarks Average | Accuracy67.2 | 9 |