Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

About

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly available benchmarks.

Soravit Changpinyo, Bo Pang, Piyush Sharma, Radu Soricut• 2019

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS-COCO (test)
CIDEr106
117
Visual Question AnsweringVizWiz (test)--
66
Image CaptioningConceptual Captions (test)
CIDEr0.984
34
Image CaptioningConceptual Captions (dev)
CIDEr93.7
9
Sketch CaptioningCOCO FS (test)
BLEU-153.9
7
Image CaptioningConceptual Captions Google-CC 3M (dev)
CIDEr93.7
7
Subjective CaptioningFS-COCO (test)
BLEU-178.7
4
Showing 7 of 7 rows

Other info

Follow for update