RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
About
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Diagnostic Personalization | LSD | Recall52.1 | 18 | |
| Diagnostic Personalization | ITR | Recall27.8 | 18 | |
| Diagnostic Personalization | LAR | Recall0.178 | 18 | |
| Personalized Grounding | Yo'LLaVA | Precision100 | 14 | |
| Personalized Grounding | DreamBooth | Precision100 | 14 | |
| Personalized Grounding | MyVLM | Precision100 | 14 | |
| Personalized Image Captioning | CapEval-QAs (test) | 1-Concept Acc+44 | 9 | |
| Multi-concept Personalized Grounding | Multi-concept personalized grounding Skip-Retrieval | Precision1 | 7 | |
| Multi-concept Personalized Grounding | Multi-concept personalized grounding Retrieval | Precision99.3 | 7 |