Context Disentangling and Prototype Inheriting for Robust Visual Grounding

About

Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios. {The code is available at https://github.com/WayneTomas/TransCP.

Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, Zechao Li• 2023

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy73.07	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy84.25	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.8738	346
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy63.35	244
Referring Expression Comprehension	RefCOCO+ (testA)	Accuracy78.05	216
Referring Expression Comprehension	RefCOCO (testB)	Accuracy79.78	213
Referring Expression Comprehension	ReferItGame (test)	Top-1 Acc72.05	47
Phrase Localization	Flickr30K Entities (test)	Accuracy80.04	35
Referring Expression Comprehension (RSREC)	RRSIS-D (test)	Pr@0.530.56	29
Remote Sensing Referring Expression Comprehension (RSREC)	RISBench (test)	Precision @ 0.529.87	16

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord