Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

About

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.

Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Personalized UnderstandingOmniPBench
Rec Weight0.924
14
Image CaptioningMyVLM
Caption Recall (Single)0.975
11
Multiple-choice Question AnsweringYo'LLaVA
Choice-V & T Accuracy (Single)94.2
11
RecognitionMC-LLaVA
Single Score93.2
11
RecognitionYo'LLaVA
Rec. Single96.2
11
RecognitionMyVLM
Single Recall98.4
11
Visual GroundingMC-LLaVA
VG Score0.748
11
Visual Multiple Choice Question AnsweringMC-LLaVA
Choice-V Accuracy (Single)90.9
11
Image CaptioningMC-LLaVA
Caption Recall (Single)77.2
11
Visual Question AnsweringMC-LLaVA
VQA BLEU (Single)70.1
11
Showing 10 of 18 rows

Other info

Follow for update