GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

About

Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

Bin Wang, Ruotong Hu, Wentong Li, Wenqian Wang, Mingliang Gao, Runmin Cong, Wei Zhang, Xudong Jiang• 2025

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics 400 (test)	Top-1 Accuracy77.8	245
Action Recognition	SSV2	Top-1 Acc14.7	142
Action Recognition	HMDB51	Mean Accuracy70.8	61
Action Recognition	UCF-101	Accuracy95.5	60
Action Recognition	HMDB-51	Base Accuracy78.3	51
Action Recognition	UCF-101	Base Accuracy96.8	44
Action Recognition	Kinetics-400	Base Accuracy77	42
Action Recognition	HMDB51 (val)	--	17
Video Action Recognition	SS v2	Base Score18.7	15

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord