Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

What Makes for Good Image Captions?

About

This paper establishes a formal information-theoretic framework for image captioning, conceptualizing captions as compressed linguistic representations that selectively encode semantic units in images. Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. By formulating these aspects as quantitative measures with adjustable weights, our framework provides a flexible foundation for analyzing and optimizing image captioning systems across diverse task requirements. To demonstrate its applicability, we introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information. We present both theoretical proof that PoCa improves caption quality under certain assumptions, and empirical validation of its effectiveness across various image captioning models and datasets.

Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung• 2024

Related benchmarks

TaskDatasetResultRank
Fine-grained Image CaptioningDetailCaps (test)
CAPTURE55.38
29
Image CaptioningDID-Bench GT-{LLaVA}
BLEU-12.26
19
Image CaptioningDID-Bench GT-{GPT4-V}
BLEU-13.09
19
Image Reconstruction SimilarityD2I-Bench
CLIP Score75.2
15
Image CaptioningDID-Bench GT-GPT4-V 1.0 (test)
BLEU-14.43
15
Image CaptioningDID-Bench GT-LLaVA (test)
BLEU-14.29
15
Multimodal EvaluationDID-Bench
CLIP-S Score38.36
12
Showing 7 of 7 rows

Other info

Follow for update