Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering

About

The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.

Kazumoto Nakamura, Yuji Nozawa, Yu-Chieh Lin, Kengo Nakata, Youyang Ng• 2024

Related benchmarks

TaskDatasetResultRank
Image ClusteringCIFAR-10
NMI0.8682
243
Image ClusteringSTL-10
ACC82.76
229
Image ClusteringCIFAR-100
ACC65.02
101
Image ClusteringTiny-ImageNet
ACC0.6823
37
Showing 4 of 4 rows

Other info

Code

Follow for update