Personalization Toolkit: Training Free Personalization of Large Vision Language Models
About
Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users and object instances, and to generate contextually tailored responses. Existing approaches typically rely on time-consuming test-time training for each user or object, making them impractical for real-world deployment, a limitation reflected in current personalization benchmarks, which are focused on object-centric, single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization and introduce a comprehensive real-world benchmark designed to rigorously evaluate various aspects of the personalization task. Our method leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| MLLM Personalization | LCMP-E | ACC-C56.25 | 12 | |
| MLLM Personalization | LCMP-H | ACC-C46.11 | 12 | |
| Visual Grounding | MC-LLaVA | VG Score0.723 | 11 | |
| Image Captioning | MyVLM | Caption Recall (Single)0.97 | 11 | |
| Multiple-choice Question Answering | Yo'LLaVA | Choice-V & T Accuracy (Single)92.2 | 11 | |
| Recognition | Yo'LLaVA | Rec. Single94.6 | 11 | |
| Recognition | MyVLM | Single Recall97.2 | 11 | |
| Image Captioning | MC-LLaVA | Caption Recall (Single)72.9 | 11 | |
| Visual Multiple Choice Question Answering | MC-LLaVA | Choice-V Accuracy (Single)87.4 | 11 | |
| Recognition | MC-LLaVA | Single Score79.1 | 11 |