Language Models as Black-Box Optimizers for Vision-Language Models

About

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.

Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	11 Downstream Classification Datasets (ImageNet, Flowers102, DTD, OxfordPets, StanfordCars, UCF101, Caltech101, Food101, SUN397, FGVC-Aircraft, EuroSAT) standard (test)	DTD Accuracy44.8	115
Image Classification	Average across 10 datasets	Average Accuracy54.5	21
Image Classification	EuroSAT 16-shot	Accuracy48	19
Image Classification	13-Dataset Image Classification Suite (IN-1K, Caltech, Cars, CUB, DTD, ESAT, FGVC, FLO, Food, Pets, Places, SUN, UCF) (test)	Accuracy (IN-1K)59.6	17
Image Classification	DTD 16-shot	--	15
Prompt Optimization	P2-hard	DSGScore85.9	7
Image Classification	Food101 16-shot	Accuracy78.5	7
Image Classification	GTSRB 16-shot	Accuracy21.2	7
Image Classification	StanfordCars 16-shot	Accuracy56.2	7
Image Classification	SVHN 16-shot	Accuracy20.2	7

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord