CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation
About
CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc42.1 | 1236 | |
| Image Classification | ImageNet V2 | -- | 611 | |
| Image Classification | EuroSAT | Accuracy25.5 | 569 | |
| Image Classification | Food101 | Accuracy43.2 | 457 | |
| Image Classification | SUN397 | Accuracy52 | 431 | |
| Image Classification | CIFAR-10 | Accuracy75.5 | 421 | |
| Image Classification | RESISC45 | Accuracy32.6 | 349 | |
| Image Classification | Caltech101 | Accuracy78 | 228 | |
| Image-to-Text Retrieval | MSCOCO | R@127.8 | 129 | |
| Text-to-Image Retrieval | MSCOCO | R@125.1 | 123 |