| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Grounding | RefCOCOg (val) | Accuracy93.2 | 93 | |
| Referring Expression Segmentation | RefCOCOg (val-u) | cIoU79.4 | 89 | |
| Referring Expression Segmentation | RefCOCOg (test) | cIoU81.8 | 78 | |
| Referring Expression Comprehension | RefCOCOg (test-u) | Precision89.37 | 71 | |
| Referring Expression Comprehension | RefCOCOg (val-u) | Accuracy90.58 | 57 | |
| Referring Image Segmentation | RefCOCOg (test(U)) | mIoU81.32 | 46 | |
| Referring Image Segmentation | RefCOCOg (val (U)) | mIoU80.01 | 46 | |
| Referring Image Segmentation | RefCOCOg (val) | oIoU75.3 | 37 | |
| Referring Expression Grounding | RefCOCOg (test) | Accuracy90 | 37 | |
| Referring Expression Generation | RefCOCOg (val) | METEOR21.3 | 31 | |
| Referring Image Segmentation | RefCOCOg (test) | cIoU77 | 29 | |
| Localization | RefCOCOg (val) | Accuracy87.03 | 26 | |
| Referring Segmentation | RefCOCOg (test) | cIoU80.1 | 23 | |
| Referring Segmentation | refCOCOg U (test) | cIoU71.4 | 22 | |
| Referring Segmentation | refCOCOg U (val) | cIoU70.1 | 20 | |
| Referring Expression Object Segmentation | RefCOCOg UMD (val) | cIoU75.7 | 20 | |
| Region-Level Captioning | refCOCOg (test) | CIDEr168.2 | 18 | |
| Referring Image Segmentation | RefCOCOg UMD (val) | mIoU79.7 | 17 | |
| Visual Grounding | RefCOCOg | Accuracy88.44 | 17 | |
| Referring Image Segmentation | RefCOCOg UMD (test) | mIoU74.3 | 16 | |
| REC | RefCOCOg | ASR95.61 | 16 | |
| Referring Image Segmentation | RefCOCOg Google (val) | mIoU69 | 15 | |
| Referring Expression Grounding | RefCOCOg (val) | Accuracy (IoU=0.5)86.7 | 14 | |
| Referring Expression Comprehension | RefCOCOg UMD (test) | Precision@0.5 IoU86.18 | 14 | |
| Referring Segmentation | RefCOCOg | cIoU76.2 | 14 |