Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

About

Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan, Feng Liu• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	DTD	Accuracy50.87	599
Image Classification	ImageNet 1k (test)	Top-1 Accuracy77.74	456
Image Classification	ImageNet	--	431
Image Classification	CUB-200 2011	Accuracy45.86	374
Image Classification	ImageNet	Top-1 Accuracy77.82	343
Image Classification	CUB	Accuracy65.67	331
Image Classification	ImageNet (test)	Top-1 Accuracy77.82	299
Image Classification	OxfordPets	Accuracy94.96	298
Image Classification	Food101	Accuracy93.97	177
Image Classification	Oxford Pets (test)	Accuracy94.96	125

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord