Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

About

Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.

Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	FGVC-Aircraft (test)	Accuracy38.46	322
Image Classification	Caltech101 (test)	Accuracy96.06	204
Image Classification	DTD (test)	Accuracy (DTD Test)40.02	65
Image Classification	Average 9 datasets (test)	Accuracy81.97	19

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord