Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Personalization Toolkit: Training Free Personalization of Large Vision Language Models

About

Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users and object instances, and to generate contextually tailored responses. Existing approaches typically rely on time-consuming test-time training for each user or object, making them impractical for real-world deployment, a limitation reflected in current personalization benchmarks, which are focused on object-centric, single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization and introduce a comprehensive real-world benchmark designed to rigorously evaluate various aspects of the personalization task. Our method leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

Soroush Seifi, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi• 2025

Related benchmarks

TaskDatasetResultRank
MLLM PersonalizationLCMP-E
ACC-C56.25
12
MLLM PersonalizationLCMP-H
ACC-C46.11
12
Visual GroundingMC-LLaVA
VG Score0.723
11
Image CaptioningMyVLM
Caption Recall (Single)0.97
11
Multiple-choice Question AnsweringYo'LLaVA
Choice-V & T Accuracy (Single)92.2
11
RecognitionYo'LLaVA
Rec. Single94.6
11
RecognitionMyVLM
Single Recall97.2
11
Image CaptioningMC-LLaVA
Caption Recall (Single)72.9
11
Visual Multiple Choice Question AnsweringMC-LLaVA
Choice-V Accuracy (Single)87.4
11
RecognitionMC-LLaVA
Single Score79.1
11
Showing 10 of 12 rows

Other info

Follow for update