MC-LLaVA: Multi-Concept Personalized Vision-Language Model

About

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.

Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang• 2024

Related benchmarks

Task	Dataset	Result
Captioning	Personalization Benchmark	Single Score68.6	23
MVQA	Personalization Benchmark	MVQA Score (Single)81.5	23
Personalized Understanding	OmniPBench	Rec Weight0.924	14
Concept Recognition	Unifybench++	Weight92.4	13
Question Answering	Unifybench++	BLEU60.6	13
Visual Question Answering	Unifybench++	BLEU62.3	13
Dense Multimodal Reasoning	Unifybench++	GPT Score51.1	13
Multimodal Reasoning	Unifybench++	BLEU29.7	13
Image Captioning	MyVLM	Caption Recall (Single)0.975	11
Multiple-choice Question Answering	Yo'LLaVA	Choice-V & T Accuracy (Single)94.2	11

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord