Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

About

Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan, Feng Liu• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationDTD
Accuracy50.87
487
Image ClassificationImageNet--
429
Image ClassificationImageNet 1k (test)
Top-1 Accuracy77.74
359
Image ClassificationImageNet (test)
Top-1 Accuracy77.82
291
Image ClassificationCUB-200 2011
Accuracy45.86
257
Image ClassificationFood101 (test)
Accuracy93.97
87
ClassificationCUB
Accuracy58.24
85
ClassificationCUB (test)
Accuracy65.67
79
Image ClassificationDownstream Datasets Average
Average Accuracy72.98
57
ClassificationFood101--
51
Showing 10 of 22 rows

Other info

Follow for update