| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO (val) | Accuracy93.7 | 348 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy94.33 | 346 | |
| Referring Expression Segmentation | RefCOCO (testA) | cIoU91.2 | 315 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy92.2 | 300 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy92.7 | 300 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU85.72 | 274 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU91.2 | 272 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU91.2 | 261 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU92 | 259 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU3,081 | 245 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy86.9 | 244 | |
| Visual Grounding | RefCOCO+ (testB) | Accuracy87.9 | 219 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy91.46 | 213 | |
| Referring Image Segmentation | RefCOCO+ (val) | mIoU81.28 | 194 | |
| Referring Image Segmentation | RefCOCO (test B) | mIoU84.52 | 186 | |
| Visual Grounding | RefCOCO (val) | Accuracy95.2 | 172 | |
| Referring Expression Segmentation | RefCOCOg (val) | cIoU86.5 | 172 | |
| Visual Grounding | RefCOCO (TestA) | Accuracy96.5 | 162 | |
| Referring Expression Comprehension | RefCOCO (test-B) | Accuracy92.5 | 160 | |
| Visual Grounding | RefCOCO (TestB) | Accuracy92.6 | 159 | |
| Visual Grounding | RefCOCOg (test) | Accuracy93.3 | 155 | |
| Referring Segmentation | refCOCO (val) | cIoU86.3 | 84 | |
| Referring Segmentation | refCOCO (testA) | cIoU87.5 | 83 | |
| Referring Expression Segmentation | RefCOCOg (test-u) | cIoU78.9 | 78 | |
| Referring Segmentation | refCOCOg (val) | CIoU84 | 72 |