Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

About

Transformers are remarkably versatile, suggesting the existence of generic inductive biases beneficial across modalities. In this work, we explore a new way to instil such biases in vision transformers (ViTs) through pretraining on procedurally generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1K, for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1K data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1k (val)--
543
Image ClassificationFood-101
Accuracy90.79
542
Image ClassificationTiny-ImageNet
Accuracy87.93
266
Image ClassificationSTL-10
Top-1 Accuracy98.66
146
Image ClassificationCIFAR-100
Accuracy89.2
117
Showing 5 of 5 rows

Other info

Follow for update