| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO (val) | Accuracy93.7 | 344 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy94.33 | 342 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy92.2 | 300 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy92.7 | 300 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU85.72 | 259 | |
| Referring Expression Segmentation | RefCOCO (testA) | cIoU87.1 | 257 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy86.9 | 244 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU3,081 | 230 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU80.2 | 223 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU83.5 | 213 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU85 | 212 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy91.46 | 205 | |
| Visual Grounding | RefCOCO+ (testB) | Accuracy87.9 | 180 | |
| Referring Image Segmentation | RefCOCO+ (val) | mIoU81.28 | 179 | |
| Referring Image Segmentation | RefCOCO (test B) | mIoU84.52 | 171 | |
| Referring Expression Comprehension | RefCOCO (test-B) | Accuracy92.5 | 160 | |
| Visual Grounding | RefCOCO (val) | Accuracy95.2 | 147 | |
| Visual Grounding | RefCOCO (TestB) | Accuracy92.6 | 138 | |
| Referring Expression Segmentation | RefCOCOg (val) | cIoU82.1 | 129 | |
| Visual Grounding | RefCOCO (TestA) | Accuracy96.5 | 123 | |
| Visual Grounding | RefCOCOg (test) | Accuracy93.3 | 119 | |
| Referring Expression Segmentation | RefCOCOg (test-u) | cIoU78.9 | 78 | |
| Referring Segmentation | refCOCO (val) | cIoU84.8 | 51 | |
| Referring Expression Segmentation | refCOCO UMD (val) | cIoU85.1 | 50 | |
| Referring Expression Comprehension | RefCOCO v1 (val) | Top-1 Accuracy92.83 | 49 |