Referring Transformer: A One-step Approach to Multi-task Visual Grounding
About
As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy77.55 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy85.65 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.8873 | 333 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy80.01 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy79.25 | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy68.99 | 235 | |
| Referring Expression Segmentation | RefCOCO (testA) | -- | 217 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | Accuracy82.26 | 207 | |
| Referring Expression Segmentation | RefCOCO+ (val) | -- | 201 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU59.4 | 200 |