Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

About

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%.

Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach• 2022

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy55.07
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy54.04
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.586
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy61.05
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy56.96
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy47.41
235
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy60.47
207
Referring Expression ComprehensionRefCOCO (testB)
Accuracy49.54
196
Visual GroundingWho's Waldo (test)
Accuracy29.4
31
Spatial-Conditioned ReasoningSCaR
RefCOCO+ Score17.6
27
Showing 10 of 16 rows

Other info

Code

Follow for update