Vision-Language Models are Strong Noisy Label Detectors
About
Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Clothing1M (test) | Accuracy72.44 | 546 | |
| Image Classification | Webvision (test) | Acc85.12 | 57 | |
| Image Classification | CIFAR-100 40% symmetric noise (test) | -- | 19 | |
| Image Classification | CIFAR-100 60% symmetric noise (test) | -- | 19 | |
| Image Classification | CIFAR-100 N (test) | -- | 19 | |
| Image Classification | CIFAR-100-N | -- | 11 | |
| Noisy label detection | CIFAR-100N natural label noise (train) | Precision88.43 | 8 | |
| Image Classification | Tiny-ImageNet Symmetric Noise 0.2 (test) | Accuracy (Best)0.8291 | 5 | |
| Image Classification | CIFAR-100 Symmetric Noise 0.2 (test) | Accuracy (Best)89.38 | 5 | |
| Image Classification | CIFAR-100 Instance-dependent Noise 0.2 (test) | Accuracy (Best)89.38 | 5 |