Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

About

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Fine-grained Image CaptioningDetailCaps (test)
CAPTURE62.53
29
Image CaptioningDID-Bench GT-{GPT4-V}
BLEU-133.53
19
Image CaptioningDID-Bench GT-{LLaVA}
BLEU-132.45
19
Image CaptioningDID-Bench GT-GPT4-V 1.0 (test)
BLEU-133.73
15
Image CaptioningDID-Bench GT-LLaVA (test)
BLEU-137.04
15
Image Reconstruction SimilarityD2I-Bench
CLIP Score76.21
15
Image CaptioningCOMPOSITIONCAP (test)
ROUGE-L32.9
14
Multimodal EvaluationDID-Bench
CLIP-S Score41
12
Linguistic Complexity EvaluationLIN-Bench
ARI10.05
12
Showing 9 of 9 rows

Other info

Follow for update