CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

About

Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero-shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state-of-the-art SSKD methods on both benchmarks.

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu• 2025

Related benchmarks

Task	Dataset	Result
Instance Segmentation	Cityscapes (val)	AP40.4	247
Instance Segmentation	ADE20K 10% labeled data (val)	maskAP16.7	22
Instance Segmentation	Cityscapes 10% labeled data (val)	maskAP33.9	11

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord