Structure over Pixels: Learning Variable-Length Visual Programs

About

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

Piotr Wyrwi\'nski, Kacper Dobek, Krzysztof Krawiec• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU15.4	3089
Semantic segmentation	ADE20K	mIoU15.4	699
Semantic segmentation	Cityscapes (val)	mIoU30.4	552
Semantic segmentation	Cityscapes	mIoU30.4	526
Depth Estimation	NYU v2 (test)	--	438
Semantic segmentation	Pascal VOC	mIoU0.675	295
Depth Estimation	NYU V2	RMSE0.669	207
Semantic segmentation	PASCAL VOC 2012 (val)	mIoU67.5	166
Semantic segmentation	COCO Stuff-27 (val)	mIoU39.2	92
Semantic segmentation	COCO-Stuff 27	mIoU39.2	67

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord