Structure over Pixels: Learning Variable-Length Visual Programs
About
Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU15.4 | 3069 | |
| Semantic segmentation | ADE20K | mIoU15.4 | 559 | |
| Semantic segmentation | Cityscapes (val) | mIoU30.4 | 527 | |
| Semantic segmentation | Cityscapes | mIoU30.4 | 494 | |
| Depth Estimation | NYU v2 (test) | -- | 435 | |
| Semantic segmentation | Pascal VOC | mIoU0.675 | 280 | |
| Depth Estimation | NYU V2 | RMSE0.669 | 167 | |
| Semantic segmentation | PASCAL VOC 2012 (val) | mIoU67.5 | 166 | |
| Semantic segmentation | COCO Stuff-27 (val) | mIoU39.2 | 92 | |
| Semantic segmentation | COCO-Stuff 27 | mIoU39.2 | 67 |