Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

About

Transformers are remarkably versatile, suggesting the existence of generic inductive biases beneficial across modalities. In this work, we explore a new way to instil such biases in vision transformers (ViTs) through pretraining on procedurally generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1K, for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1K data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1k (val)	--	708
Image Classification	Food-101	Accuracy90.79	570
Image Classification	Tiny-ImageNet	Accuracy87.93	269
Image Classification	CIFAR-100	Accuracy89.2	204
Image Classification	STL-10	Top-1 Accuracy98.66	146

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord