Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Personalization Toolkit: Training Free Personalization of Large Vision Language Models

About

Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming training for each item, making them impractical for real-world deployment, as reflected in current personalization benchmarks limited to object-centric single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization called \ours. We introduce a comprehensive, real-world benchmark designed to rigorously evaluate various aspects of the personalization task. \ours leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, Rahaf Aljundi• 2025

Related benchmarks

TaskDatasetResultRank
CaptioningPersonalization Benchmark
Single Score65.8
23
MVQAPersonalization Benchmark
MVQA Score (Single)70
23
MLLM PersonalizationLCMP-E
ACC-C56.25
12
MLLM PersonalizationLCMP-H
ACC-C46.11
12
Visual GroundingMC-LLaVA
VG Score0.723
11
Image CaptioningMyVLM
Caption Recall (Single)0.97
11
Multiple-choice Question AnsweringYo'LLaVA
Choice-V & T Accuracy (Single)92.2
11
RecognitionYo'LLaVA
Rec. Single94.6
11
RecognitionMyVLM
Single Recall97.2
11
Image CaptioningMC-LLaVA
Caption Recall (Single)72.9
11
Showing 10 of 35 rows

Other info

Follow for update