| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Object Recognition | CC3M (test) | Recall0.738 | 21 | |
| Multi-Tag Selection | CC3M (test) | Precision92.5 | 9 | |
| Text-to-image generation | CC3M | FID6.06 | 7 | |
| Multi-Tag Selection | CC3M | Precision0.883 | 6 | |
| Vision-Language Compositional Evaluation | CC3M 50,000 random subset TripletData | Text Score92.25 | 4 | |
| Text-level Semantic Segmentation | CC3M (subset) | Caption IoU65.5 | 4 | |
| Object Recognition | CC3M | Recall86.8 | 3 |