DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
About
Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Base-to-New Generalization | Avg over 11 datasets | Base Score85.94 | 90 | |
| Base-to-New Generalization | DTD | Base Accuracy83.9 | 82 | |
| Base-to-New Generalization | ImageNet | Base Accuracy78.12 | 81 | |
| Base-to-New Generalization | FGVCAircraft | Base Performance47.1 | 78 | |
| Base-to-New Generalization | UCF101 | Base Accuracy87.9 | 71 | |
| Base-to-New Generalization | OxfordPets | Base Score97.34 | 64 | |
| Base-to-New Generalization | Caltech101 | Base Score99 | 58 | |
| Base-to-New Generalization | StanfordCars | Base Score82.01 | 57 | |
| Image Classification | ImageNet V2 (Target) | Accuracy64.87 | 48 | |
| Base-to-novel generalization | Flowers102 | Base Accuracy99 | 43 |