Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
About
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.434 | 682 | |
| Image Captioning | nocaps (val) | CIDEr (Overall)122.1 | 93 | |
| Image Captioning | COCO (Karpathy split) | CIDEr150.2 | 74 | |
| Image Captioning | NoCaps (test) | CIDEr (overall)119.3 | 61 | |
| Image Captioning | nocaps standard (test) | CIDEr119.3 | 26 |