| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multimodal Understanding | VLM Evaluation Suite Hall, MME, AI2D, RWQA, SQA, POPE, MBen, MBzh, CCB, VSR, V7W | Hall63.76 | 40 | |
| Video Generation | VLM Evaluation Suite | Aesthetic Appeal8.25 | 8 | |
| Multimodal Understanding | VLM Evaluation Suite (GQA, MMB, MMBCN, MME, POPE, SQA, VQAv2, VQAText) LLaVA-NEXT-7B (test) | GQA Accuracy64.2 | 7 |