Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

About

In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao• 2026

Related benchmarks

TaskDatasetResultRank
Open Vocabulary Semantic SegmentationADE-847
mIoU4.73
63
Open Vocabulary Semantic SegmentationPC-459
mIoU7
47
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy49.03
24
Image-Text RetrievalCOCO 1.0 (test)
R@148.79
24
Image ClassificationVTAB+ 1.0 (test)
Top-1 Accuracy38.89
24
Open-Vocabulary SegmentationPascal VOC
mIoU79.52
16
Open Vocabulary Semantic SegmentationADE-150
mIoU20.14
15
General VQAPOPE
Accuracy75.15
14
General Vision-LanguageVQA v2
VQA v2 Accuracy56.07
9
Open Vocabulary Semantic SegmentationPC-59
mIoU42.17
4
Showing 10 of 12 rows

Other info

Follow for update