Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

About

Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.

Yiming Ma, Hongkun Yang, Lionel Z. Wang, Bin Chen, Weizhi Xian, Jianzhi Teng• 2026

Related benchmarks

TaskDatasetResultRank
Base-to-New GeneralizationAvg over 11 datasets
Base Score85.94
90
Base-to-New GeneralizationDTD
Base Accuracy83.9
82
Base-to-New GeneralizationImageNet
Base Accuracy78.12
81
Base-to-New GeneralizationFGVCAircraft
Base Performance47.1
78
Base-to-New GeneralizationUCF101
Base Accuracy87.9
71
Base-to-New GeneralizationOxfordPets
Base Score97.34
64
Base-to-New GeneralizationCaltech101
Base Score99
58
Base-to-New GeneralizationStanfordCars
Base Score82.01
57
Image ClassificationImageNet V2 (Target)
Accuracy64.87
48
Base-to-novel generalizationFlowers102
Base Accuracy99
43
Showing 10 of 19 rows

Other info

Follow for update