Personalization Toolkit: Training Free Personalization of Large Vision Language Models

About

Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming training for each item, making them impractical for real-world deployment, as reflected in current personalization benchmarks limited to object-centric single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization called \ours. We introduce a comprehensive, real-world benchmark designed to rigorously evaluate various aspects of the personalization task. \ours leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, Rahaf Aljundi• 2025

Related benchmarks

Task	Dataset	Result
Captioning	Personalization Benchmark	Single Score65.8	23
MVQA	Personalization Benchmark	MVQA Score (Single)70	23
MLLM Personalization	LCMP-E	ACC-C56.25	12
MLLM Personalization	LCMP-H	ACC-C46.11	12
Visual Grounding	MC-LLaVA	VG Score0.723	11
Image Captioning	MyVLM	Caption Recall (Single)0.97	11
Multiple-choice Question Answering	Yo'LLaVA	Choice-V & T Accuracy (Single)92.2	11
Recognition	Yo'LLaVA	Rec. Single94.6	11
Recognition	MyVLM	Single Recall97.2	11
Image Captioning	MC-LLaVA	Caption Recall (Single)72.9	11

Showing 10 of 35 rows

Other info

Follow for update

@wizwand_team Discord