| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCOg (test) | cIoU82.8 | 118 | |
| Visual Grounding | RefCOCOg (val) | Accuracy93.2 | 114 | |
| Referring Image Segmentation | RefCOCOg (val) | oIoU75.3 | 100 | |
| Referring Expression Segmentation | RefCOCOg (val-u) | cIoU79.4 | 89 | |
| Referring Expression Comprehension | RefCOCOg (test-u) | Precision89.37 | 71 | |
| Referring Image Segmentation | RefCOCOg (test) | oIoU74.9 | 61 | |
| Referring Expression Comprehension | RefCOCOg (val-u) | Accuracy90.58 | 57 | |
| Referring Image Segmentation | RefCOCOg (test(U)) | mIoU81.32 | 54 | |
| Referring Image Segmentation | RefCOCOg (val (U)) | mIoU80.01 | 54 | |
| Visual Grounding | RefCOCOg | Accuracy88.44 | 37 | |
| Referring Expression Grounding | RefCOCOg (test) | Accuracy90 | 37 | |
| Referring Expression Generation | RefCOCOg (val) | METEOR21.3 | 31 | |
| Localization | RefCOCOg (val) | Accuracy87.03 | 26 | |
| Referring Segmentation | RefCOCOg (test) | cIoU80.1 | 23 | |
| Referring Segmentation | refCOCOg U (test) | cIoU71.4 | 22 | |
| Referring Segmentation | refCOCOg U (val) | cIoU70.1 | 20 | |
| Referring Expression Object Segmentation | RefCOCOg UMD (val) | cIoU75.7 | 20 | |
| Region-Level Captioning | refCOCOg (test) | CIDEr168.2 | 18 | |
| Referring Image Segmentation | RefCOCOg UMD (val) | mIoU79.7 | 17 | |
| Referring Image Segmentation | RefCOCOg UMD (test) | mIoU74.3 | 16 | |
| REC | RefCOCOg | ASR95.61 | 16 | |
| Referring Image Segmentation | RefCOCOg Google (val) | mIoU69 | 15 | |
| Referring Expression Segmentation | RefCOCOg Google (val) | gIoU67.2 | 15 | |
| Grounding | RefCOCOg | Score86.77 | 14 | |
| Referring Expression Grounding | RefCOCOg (val) | Accuracy (IoU=0.5)86.7 | 14 |