Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

About

Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC

Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon• 2025

Related benchmarks

TaskDatasetResultRank
Diagnostic PersonalizationLSD
Recall52.1
18
Diagnostic PersonalizationITR
Recall27.8
18
Diagnostic PersonalizationLAR
Recall0.178
18
Personalized GroundingYo'LLaVA
Precision100
14
Personalized GroundingDreamBooth
Precision100
14
Personalized GroundingMyVLM
Precision100
14
Personalized Image CaptioningCapEval-QAs (test)
1-Concept Acc+44
9
Multi-concept Personalized GroundingMulti-concept personalized grounding Skip-Retrieval
Precision1
7
Multi-concept Personalized GroundingMulti-concept personalized grounding Retrieval
Precision99.3
7
Showing 9 of 9 rows

Other info

Follow for update