Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LocCa: Visual Pretraining with Location-aware Captioners

About

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.

Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, Andr\'e Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai• 2024

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)--
257
Referring Expression SegmentationRefCOCO+ (testA)--
230
Referring Expression SegmentationRefCOCO+ (val)--
223
Referring Expression SegmentationRefCOCO (testB)--
213
Referring Expression SegmentationRefCOCO (val)--
212
Referring Expression SegmentationRefCOCO+ (testB)--
210
Referring Expression SegmentationRefCOCOg (val (U))--
89
Referring Expression SegmentationRefCOCOg (test(U))--
78
Grounded captioningVisual Genome
METEOR20.7
4
Showing 9 of 9 rows

Other info

Follow for update