Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

About

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara• 2021

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr1.434
682
Image Captioningnocaps (val)
CIDEr (Overall)122.1
93
Image CaptioningCOCO (Karpathy split)
CIDEr150.2
74
Image CaptioningNoCaps (test)
CIDEr (overall)119.3
61
Image Captioningnocaps standard (test)
CIDEr119.3
26
Showing 5 of 5 rows

Other info

Follow for update