Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

About

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang• 2022

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy75.4
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy86.3
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.884
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy76.3
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy76.2
300
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy66.3
244
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy80.5
216
Referring Expression ComprehensionRefCOCO (testB)
Accuracy81
205
Referring Expression ComprehensionRefCOCO+ (test-B)
Accuracy66.28
167
Referring Expression ComprehensionRefCOCOg (test(U))
Precision76.3
71
Showing 10 of 12 rows

Other info

Follow for update