Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sequential Modeling Enables Scalable Learning for Large Vision Models

About

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros• 2023

Related benchmarks

TaskDatasetResultRank
Depth EstimationNYU Depth V2--
177
Surface Normal PredictionNYU V2
Mean Error23.433
100
Video GenerationPhysics-IQ
Phys. IQ Score18.02
45
Foreground segmentationPascal-5i (1)
mIoU48.94
16
Foreground segmentationPascal-5i (2)
mIoU51.29
13
Foreground segmentationPascal-5i (3)
mIoU47.66
13
InpaintingImageNet
FID4.05
8
ColorizationImageNet
MSE0.51
7
Foreground segmentationPascal-5i Split 4
mIoU50.82
4
Single Object DetectionPascal-5i (Split 1)
mIoU48.25
4
Showing 10 of 14 rows

Other info

Code

Follow for update