| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy90.4 | 354 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | cIoU91.2 | 288 | |
| Referring Image Segmentation | RefCOCO+ (test B) | mIoU78.59 | 267 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | cIoU92.1 | 256 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy91.4 | 253 | |
| Visual Grounding | RefCOCO+ (testA) | Accuracy94.7 | 245 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | Accuracy91.81 | 216 | |
| Referring Expression Comprehension | RefCOCO+ (test-A) | Accuracy94.7 | 172 | |
| Referring Expression Comprehension | RefCOCO+ (test-B) | Accuracy85.6 | 167 | |
| Referring Image Segmentation | RefCOCO+ (testA) | mIoU2,982 | 112 | |
| Referring Image Segmentation | RefCOCO+ (test A) | oIoU78.7 | 89 | |
| Referring Segmentation | refCOCO+ (testA) | cIoU0.842 | 60 | |
| Referring Segmentation | refCOCO+ (val) | cIoU80.1 | 49 | |
| Referring Expression Segmentation | RefCOCO+ UMD (testB) | Overall IoU50.43 | 34 | |
| Referring Expression Segmentation | RefCOCO+ UMD (testA) | Overall IoU63.74 | 34 | |
| Referring Expression Segmentation | RefCOCO+ UMD (val) | Overall IoU57.95 | 34 | |
| Localization | RefCOCO+ (val) | Accuracy85.05 | 32 | |
| Localization | RefCOCO+ (testB) | Accuracy78.77 | 26 | |
| Localization | RefCOCO+ (testA) | Accuracy91.56 | 26 | |
| Referring Expression Segmentation | RefCOCO+ UMD partition (test B) | oIoU68.1 | 23 | |
| Referring Expression Segmentation | RefCOCO+ UMD partition (test A) | oIoU78.7 | 23 | |
| Referring Expression Grounding | RefCOCO+ (testB) | Accuracy83.5 | 23 | |
| Referring Expression Grounding | RefCOCO+ (testA) | Accuracy92.8 | 23 | |
| Referring Expression Segmentation | RefCOCO+ UNC (val) | cIoU70.3 | 18 | |
| Referring Expression Comprehension | RefCOCO+ 80 (val) | Accuracy87.43 | 17 |