TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

About

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang• 2022

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy75.4	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy86.3	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.884	346
Referring Expression Comprehension	RefCOCOg (test)	Accuracy76.3	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy76.2	300
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy66.3	244
Referring Expression Comprehension	RefCOCO+ (testA)	Accuracy80.5	216
Referring Expression Comprehension	RefCOCO (testB)	Accuracy81	213
Referring Expression Comprehension	RefCOCO+ (test-B)	Accuracy66.28	167
Referring Expression Comprehension	RefCOCOg (test(U))	Precision76.3	71

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord