Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

About

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

Muchen Li, Leonid Sigal• 2021

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy77.55
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy85.65
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.8873
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy80.01
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy79.25
300
Referring Image SegmentationRefCOCO (val)
mIoU74.34
259
Referring Expression SegmentationRefCOCO (testA)--
257
Referring Image SegmentationRefCOCO+ (test-B)
mIoU59.4
252
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy68.99
244
Referring Image SegmentationRefCOCO (test A)
mIoU76.77
230
Showing 10 of 81 rows
...

Other info

Follow for update