| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Compositional Vision-Language Reasoning | Winoground | Text Score89.5 | 61 | |
| Compositional Scene Understanding | Winoground | Text Alignment Score64 | 44 | |
| Compositional Reasoning | Winoground | Group Score41.25 | 30 | |
| Image-Text Matching | Winoground | Text Agreement Score89.5 | 26 | |
| Vision-Language Compositional Reasoning | Winoground 1.0 (test) | Text Score89.5 | 23 | |
| Compositional Evaluation | Winoground (test) | Text Score74 | 15 | |
| Visual Question Answering | WinogroundVQA v1.0 (test) | Accuracy46.5 | 14 | |
| Fine-grained retrieval | Winoground (test) | Text Agreement (%)40 | 12 | |
| Image-text alignment | Winoground (test) | Text Score89.5 | 12 | |
| Fine-grained Image-Text Matching | Winoground | Group Agreement25.8 | 11 | |
| Image-Text Retrieval | Winoground (test) | Text Score74 | 10 | |
| Vision-Language Reasoning | Winoground | Simple Acc59.88 | 9 | |
| Text-to-image retrieval | Winoground | R@1 (T2I)0.133 | 8 | |
| Vision-Language Compositional Reasoning | Winoground standard (test) | Text Score75.5 | 7 | |
| Text Selection | Winoground | Text Score34 | 7 | |
| Image Selection | Winoground | Image Score14 | 7 | |
| Vision-Language Compositional Reasoning | Winoground (test) | Text Score61.3 | 7 | |
| Vision-Language Understanding | Winoground | Text Accuracy61.5 | 5 | |
| Image-Text Matching | Winoground 1.0 (full) | Text Agreement Score89.5 | 5 | |
| Vision-Language Reasoning | Winoground | Text Score30.5 | 4 | |
| Compositional Evaluation | Winoground Txt2Img | Txt2Img Score14 | 4 | |
| Image-Text Matching | Winoground clean | Text Agreement Score52.63 | 4 | |
| Vision-Language Alignment | Winoground | Accuracy63.38 | 3 | |
| Image-Text Matching | Winoground (full) | Accuracy52.7 | 3 | |
| Compositional Reasoning | Winoground (test) | Image Accuracy27 | 3 |