UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels
About
Contrastive objectives power state-of-the-art multimodal models, but their training remains slow, relying on long stochastic optimization. We propose a Unified Framework for Efficient Contrastive Alignment via Kernels (UniCon), which spans linear and nonlinear encoders as well as one-to-one and many-to-many alignments. At its core, UniCon introduces the contrastive similarity weight matrix $S(\gamma)$, which enables closed-form global solutions that provably replace minibatch back-propagation with exact updates. Through the lens of reproducing kernel Hilbert spaces (RKHS), UniCon provides a kernelized perspective that unifies contrastive alignment and reveals its connection to spectral methods. To validate the theory, we conduct experiments on synthetic, unimodal, multimodal, and zero-shot tasks, demonstrating that UniCon achieves substantial efficiency gains while preserving generality and strong empirical performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30K | R@142.1 | 559 | |
| Text-to-Image Retrieval | Flickr30k (test) | -- | 525 | |
| Image-to-Text Retrieval | Flickr30k (test) | -- | 472 | |
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@129.2 | 312 | |
| Image-to-Text Retrieval | MSCOCO 5K (test) | R@132.9 | 68 | |
| Text-to-Image Retrieval | MSCOCO (5K) | R@129.2 | 51 | |
| Audio-to-Text Retrieval | Clotho | R@13.35 | 49 | |
| Image-Text Retrieval | Flickr30k (test) | -- | 45 | |
| Image-to-Text Retrieval | MSCOCO (5K) | R@132.9 | 42 | |
| Text-to-Audio Retrieval | Clotho | R@10.0249 | 31 |