Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

About

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://hoar012.github.io/RAP-Project/.

Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue• 2024

Related benchmarks

TaskDatasetResultRank
Diagnostic PersonalizationLSD
Recall28.8
18
Diagnostic PersonalizationLAR
Recall0.014
18
Diagnostic PersonalizationITR
Recall0.2
18
Personalized UnderstandingOmniPBench
Rec Weight0.94
14
MLLM PersonalizationLCMP-H
ACC-C45.81
12
MLLM PersonalizationLCMP-E
ACC-C52.08
12
Multiple-choice Question AnsweringYo'LLaVA
Choice-V & T Accuracy (Single)91.7
11
Visual GroundingMC-LLaVA
VG Score0.719
11
Image CaptioningMyVLM
Caption Recall (Single)0.937
11
Image CaptioningMC-LLaVA
Caption Recall (Single)71.1
11
Showing 10 of 18 rows

Other info

Follow for update