GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

About

The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100	Accuracy71.5	691
Image Classification	EuroSAT	Accuracy58.5	569
Image Classification	DTD	Accuracy48.4	487
Image Clustering	CIFAR-10	NMI0.72	318
Image Classification	CIFAR-10	Accuracy73.7	246
Image Classification	ImageNet-1K	Accuracy73.8	199
Image Retrieval	MS-COCO	--	172
Text Retrieval	Flickr30K	--	120
Image Classification	MNIST	Accuracy69.7	70
Image Retrieval	Flickr30K	Recall@581.6	49

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord