| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr149.1 | 682 | |
| Object Detection | MS COCO (test-dev) | mAP@.578.5 | 677 | |
| Object Detection | MS-COCO 2017 (val) | mAP56.74 | 237 | |
| Text-to-Image Retrieval | MS-COCO 5K (test) | R@168.3 | 223 | |
| Image Retrieval | MS-COCO 5K (test) | R@167.2 | 217 | |
| Text Retrieval | MS-COCO 5K (test) | R@184.8 | 182 | |
| Object Detection | MS COCO (val) | mAP0.603 | 138 | |
| Object Detection | MS COCO novel classes | nAP2,450 | 132 | |
| Text-to-image generation | MS-COCO 2014 (val) | FID2.47 | 128 | |
| Image Retrieval | MS-COCO 1K (test) | R@180.1 | 128 | |
| Object Detection | MS COCO novel classes 2017 (val) | AP22.73 | 123 | |
| Image-to-Text Retrieval | MS-COCO 1K (test) | R@182 | 121 | |
| Image Captioning | MS COCO (test) | CIDEr140.4 | 117 | |
| Text-to-image generation | MS-COCO (val) | FID3.22 | 112 | |
| Image Retrieval | MS-COCO (test) | MAP84.08 | 98 | |
| Object Detection | MS-COCO 2017 (test) | AP53.9 | 82 | |
| Multi-label Classification | MS-COCO 2014 (test) | mAP91.3 | 81 | |
| Object Hallucination Evaluation | MS-COCO (POPE Adversarial) | Accuracy87.62 | 80 | |
| Text-to-Image Generation | MS-COCO 2017 (val) | FID20.51 | 80 | |
| Text-to-Image Retrieval | MS-COCO | R@590.8 | 79 | |
| Object Hallucination Evaluation | MS-COCO POPE (Popular) | Accuracy90.76 | 76 | |
| Text-to-image Generation | MS-COCO | FID7.2 | 75 | |
| Retrieval | MS-COCO | Ave Aes5.109 | 72 | |
| Multi-label recognition | MS-COCO | Overall F1 Score (OF1)78.6 | 66 | |
| Image-to-Text Retrieval | MS-COCO | R@597.2 | 65 |