Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

About

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.

Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy78.4
1453
Multimodal UnderstandingMMBench
Accuracy56.86
847
Science Question AnsweringScienceQA
Accuracy83.31
791
Visual Question AnsweringChartQA
Accuracy83.64
519
Multimodal UnderstandingSEED-Bench
Accuracy75.55
516
Optical Character RecognitionOCRBench
Score78.7
433
Multimodal UnderstandingMMStar
Accuracy55.93
407
Visual PerceptionBLINK
Accuracy47.92
241
Visual Question AnsweringDocVQA
Accuracy92.78
205
Visual Question AnsweringInfoVQA
Accuracy74.17
195
Showing 10 of 11 rows

Other info

Follow for update