| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@160.9 | 308 | |
| Image-to-text retrieval | MSCOCO | R@181.9 | 129 | |
| Text-to-image retrieval | MSCOCO | R@164.3 | 123 | |
| Text-to-Image Retrieval | MSCOCO 1K (test) | R@16,390 | 118 | |
| Visual Hallucination Evaluation | MSCOCO | CHAIR_i18.2 | 104 | |
| Sentence Retrieval | MSCOCO 5k (test) | R@180.9 | 67 | |
| Image-to-text retrieval | MSCOCO 5K (test) | R@184.8 | 64 | |
| Object Hallucination Evaluation | MSCOCO 2014 (val) | CHAIRs54.6 | 55 | |
| Text-to-Image Generation | MSCOCO 30K | FID6.61 | 54 | |
| Text-to-Image Retrieval | MSCOCO (val) | R@138.97 | 51 | |
| Image-to-Text Retrieval | MSCOCO (val) | R@158.14 | 51 | |
| Object Hallucination | MSCOCO 500 images 2014 (val) | Consistency Score (CS)60.6 | 50 | |
| Object Hallucination Evaluation | MSCOCO POPE | Random Accuracy91.63 | 47 | |
| Text-to-Image Generation | MSCOCO 2014 | FID (30k)9.29 | 44 | |
| Object Detection | MSCOCO (val) | AP61.3 | 43 | |
| Text-to-image Retrieval | MSCOCO (5K) | R@153.98 | 42 | |
| Pointing game | MSCOCO 2014 (val) | Mean Accuracy (All)69.9 | 42 | |
| Object Hallucination Evaluation | MSCOCO | Accuracy88.97 | 41 | |
| Image Retrieval | MSCOCO @5000 (test) | mAP87.27 | 39 | |
| Object Hallucination Assessment | MSCOCO | CHAIR Instance Score30.2 | 38 | |
| Hallucination Evaluation | MSCOCO (val) | CHAIR_i23.04 | 36 | |
| Object Hallucination Evaluation | MSCOCO | CHAIR Scene Score56.4 | 35 | |
| Text-to-image generation | MSCOCO 30k samples 2014 (val) | FID21.96 | 35 | |
| Text Retrieval | MSCOCO | ASR@R1100 | 33 | |
| Object Detection | MSCOCO 2017 (val) | APb47.5 | 33 |