| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multimodal Understanding | MMB | Accuracy90.6 | 53 | |
| Multimodal Benchmarking | MMB | Average Performance100 | 40 | |
| Multimodal Evaluation | MMB | Score85.31 | 27 | |
| General Vision-Language Understanding | MMB | Score84.6 | 25 | |
| Visual Grounding | MMB v1.1 | Accuracy85.76 | 22 | |
| Knowledge | MMB | Accuracy61.98 | 21 | |
| Multi-modal Understanding | MMB | Score67 | 10 | |
| Multi-modality Evaluation | MMB-en (test) | Relative Performance100 | 10 | |
| Multimodal Understanding | MMB (dev) | Accuracy76 | 8 | |
| Visual Question Answering | MMB | Score83.2 | 8 | |
| Image Captioning | MMB | Prism81.34 | 7 | |
| Multimodal Benchmarking | MMB 1.1 | Accuracy82.2 | 6 | |
| MLLM Evaluation | MMB | Overall Score63.14 | 4 | |
| Multi-modal Understanding | MMB EN | Performance Score83.9 | 3 | |
| Multimodal Reasoning | MMB-CN | Accuracy54 | 3 | |
| Multimodal Reasoning | MMB | Accuracy62.8 | 3 | |
| Image Understanding | MMB | Accuracy76.4 | 2 |