CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation

About

CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyeong Sim• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc42.1	1239
Image Classification	CIFAR-10	Accuracy75.5	810
Image Classification	ImageNet V2	--	717
Image Classification	EuroSAT	Accuracy25.5	569
Image Classification	Food101	Accuracy43.2	457
Image Classification	SUN397	Accuracy52	450
Image Classification	RESISC45	Accuracy32.6	431
Image Classification	Caltech101	Accuracy78	228
Image-to-Text Retrieval	MSCOCO	R@127.8	129
Text-to-Image Retrieval	MSCOCO	R@125.1	123

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord