Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

About

Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy73.26
1453
Multimodal UnderstandingMMBench
Accuracy57.84
847
Science Question AnsweringScienceQA
Accuracy83.4
791
Visual Question AnsweringChartQA
Accuracy83.4
519
Multimodal UnderstandingSEED-Bench
Accuracy75.67
516
Optical Character RecognitionOCRBench
Score81
433
Multimodal UnderstandingMMStar
Accuracy56.94
407
Visual PerceptionBLINK
Accuracy47.88
241
Visual Question AnsweringDocVQA
Accuracy90.18
205
Visual Question AnsweringInfoVQA
Accuracy73.17
195
Showing 10 of 27 rows

Other info

Follow for update