Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Structure over Pixels: Learning Variable-Length Visual Programs

About

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

Piotr Wyrwi\'nski, Kacper Dobek, Krzysztof Krawiec• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU15.4
3069
Semantic segmentationADE20K
mIoU15.4
559
Semantic segmentationCityscapes (val)
mIoU30.4
527
Semantic segmentationCityscapes
mIoU30.4
494
Depth EstimationNYU v2 (test)--
435
Semantic segmentationPascal VOC
mIoU0.675
280
Depth EstimationNYU V2
RMSE0.669
167
Semantic segmentationPASCAL VOC 2012 (val)
mIoU67.5
166
Semantic segmentationCOCO Stuff-27 (val)
mIoU39.2
92
Semantic segmentationCOCO-Stuff 27
mIoU39.2
67
Showing 10 of 15 rows

Other info

Follow for update