Low-Rank Similarity Mining for Multimodal Dataset Distillation
About
Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at https://github.com/silicx/LoRS_Distill.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@110.3 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@114.9 | 370 | |
| Image Retrieval | Flickr30k (test) | R@15.3 | 195 | |
| Image-to-Text Retrieval | MS-COCO (test) | R@15.7 | 99 | |
| Text Retrieval | Flickr30k (test) | R@15.7 | 89 | |
| Text-to-Image Retrieval | MS-COCO (test) | R@13 | 66 |