Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

About

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

Muchen Li, Leonid Sigal• 2021

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy77.55
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy85.65
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.8873
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy80.01
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy79.25
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy68.99
235
Referring Expression SegmentationRefCOCO (testA)--
217
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy82.26
207
Referring Expression SegmentationRefCOCO+ (val)--
201
Referring Image SegmentationRefCOCO+ (test-B)
mIoU59.4
200
Showing 10 of 65 rows

Other info

Follow for update