Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

About

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, long-tail concepts remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a dynamic cluster-based sampling approach (DynamiCS) that downsamples large clusters of data and upsamples small ones. We first demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Then, we show the importance of dynamic sampling, which applies sampling at each epoch to improve cross-epoch data diversity and make upsampling practical. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts. Code available at https://github.com/MingliangLiang3/DynamiCS.

Mingliang Liang, Zhuoran Liu, Arjen P. de Vries, Martha Larson• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1k (val)	Top-1 Accuracy72.6	871
Zero-shot Image Classification	ImageNet-1K	Top-1 Accuracy71.3	125
Image-Text Retrieval	Flickr30K	R@173.9	40
Image-Text Retrieval	COCO	Recall@146.5	27
Image Classification	ImageNet and ObjectNet Robustness Suite	Average Accuracy59.1	18
Zero-shot Classification	Let It Wag!	Top-1 Accuracy50.2	9
Image Classification	Let It Wag!	Top-1 Accuracy52	8

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord