Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

About

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara• 2021

Related benchmarks

Task	Dataset	Result
Image Captioning	MS COCO Karpathy (test)	CIDEr1.434	706
Image Captioning	nocaps (val)	CIDEr (Overall)122.1	115
Image Captioning	COCO (Karpathy split)	CIDEr150.2	74
Image Captioning	NoCaps (test)	CIDEr (overall)119.3	61
Image Captioning	nocaps standard (test)	CIDEr119.3	26

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord