DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

About

Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is $R^*=3$. The full empirical study -- order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation -- fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.

Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10	--	973
Image Classification	DTD	--	610
Image Classification	FGVC-Aircraft (test)	--	322
Image Classification	Stanford Cars (test)	--	320
Image Classification	CUB-200-2011 (test)	Top-1 Acc40.8	316
Image Classification	Oxford Flowers-102 (test)	Top-1 Accuracy76.2	221
Image Classification	SVHN	Top-1 Accuracy50	209
Image Classification	Food	--	152
Image Classification	GTSRB	Top-1 Accuracy62.1	115
Image Classification	STL10	Accuracy82.4	108

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord