Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

About

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters

Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu• 2025

Related benchmarks

TaskDatasetResultRank
Fine-grained Image CaptioningDetailCaps (test)
CAPTURE64.49
29
Image CaptioningDID-Bench GT-{LLaVA}
BLEU-142.84
19
Image CaptioningDID-Bench GT-{GPT4-V}
BLEU-140.3
19
Image CaptioningDID-Bench GT-GPT4-V 1.0 (test)
BLEU-136.83
15
Image CaptioningDID-Bench GT-LLaVA (test)
BLEU-139.93
15
Image Reconstruction SimilarityD2I-Bench
CLIP Score76.48
15
Image CaptioningCOMPOSITIONCAP (test)
ROUGE-L34.6
14
Linguistic Complexity EvaluationLIN-Bench
ARI12.49
12
Multimodal EvaluationDID-Bench
CLIP-S Score41.19
12
Image CaptioningDID-Bench
CIDEr3.31
4
Showing 10 of 10 rows

Other info

Code

Follow for update