Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Language Dataset Distillation

About

Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.

Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@129.4
399
Video-to-Text retrievalDiDeMo (test)
R@129.5
111
Text RetrievalFlickr30K
R@113.3
100
Text RetrievalCOCO
R@15
53
Image RetrievalFlickr30K
Recall@520.2
49
Image RetrievalCOCO
R@12.5
47
Text RetrievalCOCO (test)
R@15
22
Image RetrievalCOCO (test)
Recall@12.5
22
Audio-to-Video RetrievalDiDeMo (test)
R@118.8
19
Text-to-Audio RetrievalDiDeMo (test)
R@14.8
19
Showing 10 of 15 rows

Other info

Follow for update