| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO (val) | Accuracy93.7 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy94.33 | 333 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy92.2 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy92.7 | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy86.9 | 235 | |
| Referring Expression Segmentation | RefCOCO (testA) | cIoU87.1 | 217 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU80.2 | 201 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU85.72 | 197 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy91.46 | 196 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU83.4 | 191 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU84.8 | 190 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU3,081 | 178 | |
| Visual Grounding | RefCOCO+ (testB) | Accuracy87.9 | 169 | |
| Referring Expression Comprehension | RefCOCO (test-B) | Accuracy92.5 | 160 | |
| Visual Grounding | RefCOCO (TestB) | Accuracy92.6 | 125 | |
| Visual Grounding | RefCOCO (val) | Accuracy95.2 | 119 | |
| Referring Image Segmentation | RefCOCO (test B) | mIoU84.52 | 119 | |
| Visual Grounding | RefCOCO (TestA) | Accuracy96.5 | 117 | |
| Referring Image Segmentation | RefCOCO+ (val) | mIoU81.28 | 117 | |
| Referring Expression Segmentation | RefCOCOg (val) | cIoU81.3 | 107 | |
| Visual Grounding | RefCOCOg (test) | Accuracy93.3 | 96 | |
| Referring Expression Segmentation | RefCOCOg (test-u) | cIoU78.9 | 78 | |
| Referring Segmentation | refCOCO (val) | cIoU84.8 | 51 | |
| Referring Expression Segmentation | refCOCO UMD (val) | cIoU85.1 | 50 | |
| Referring Expression Comprehension | RefCOCO v1 (val) | Top-1 Accuracy92.83 | 49 |