ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
About
Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy55.07 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy54.04 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.586 | 333 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy61.05 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy56.96 | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy47.41 | 235 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | Accuracy60.47 | 207 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy49.54 | 196 | |
| Visual Grounding | Who's Waldo (test) | Accuracy29.4 | 31 | |
| Spatial-Conditioned Reasoning | SCaR | RefCOCO+ Score17.6 | 27 |