Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
About
Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet Domain Generalization (Source: ImageNet, Targets: ImageNetV2, ImageNet-Sketch, ImageNet-A, ImageNet-R) (test) | Accuracy (ImageNetV2)64.23 | 105 | |
| Base-to-New Classification | 11 downstream datasets Balanced, τ=1 | IN Accuracy73.92 | 6 | |
| Base-to-New Classification | 11 downstream datasets Imbalanced, τ=0.25 | Accuracy (IN.)72.62 | 6 | |
| Base-to-New Classification | 11 downstream datasets Highly Imbalanced, τ=0.06 | IN. Score71.58 | 6 | |
| Image Classification | ImageNet-to-Target Generalization Suite τ=0.25 (test) | IN Accuracy69.89 | 6 | |
| Image Classification | ImageNet-to-Target Generalization Suite (τ=0.06) (test) | Accuracy (IN)69.58 | 6 | |
| Image Classification | ImageNet-to-Target Generalization Suite Balance, τ=1 (test) | IN Accuracy71.4 | 6 |