BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

About

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.

Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy78.4	1455
Science Question Answering	ScienceQA	Accuracy83.31	916
Multimodal Understanding	MMBench	Accuracy56.86	887
Visual Question Answering	ChartQA	Accuracy83.64	620
Multimodal Understanding	SEED-Bench	Accuracy75.55	571
Multimodal Understanding	MMStar	Accuracy55.93	511
Optical Character Recognition	OCRBench	Score78.7	486
Visual Question Answering	InfoVQA	Accuracy74.17	264
Visual Perception	BLINK	Accuracy47.92	255
Visual Question Answering	DocVQA	Accuracy92.78	205

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord