Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

About

CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyeong Sim• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc42.1
1236
Image ClassificationImageNet V2--
611
Image ClassificationEuroSAT
Accuracy25.5
569
Image ClassificationFood101
Accuracy43.2
457
Image ClassificationSUN397
Accuracy52
431
Image ClassificationCIFAR-10
Accuracy75.5
421
Image ClassificationRESISC45
Accuracy32.6
349
Image ClassificationCaltech101
Accuracy78
228
Image-to-Text RetrievalMSCOCO
R@127.8
129
Text-to-Image RetrievalMSCOCO
R@125.1
123
Showing 10 of 17 rows

Other info

Follow for update