Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

About

Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero-shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state-of-the-art SSKD methods on both benchmarks.

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu• 2025

Related benchmarks

TaskDatasetResultRank
Instance SegmentationCityscapes (val)
AP40.4
239
Instance SegmentationCityscapes 10% labeled data (val)
maskAP33.9
11
Instance SegmentationADE20K 10% labeled data (val)
maskAP16.7
11
Showing 3 of 3 rows

Other info

Follow for update