Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sequential Modeling Enables Scalable Learning for Large Vision Models

About

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros• 2023

Related benchmarks

TaskDatasetResultRank
Depth EstimationNYU Depth V2--
209
Surface Normal PredictionNYU V2
Mean Error23.433
118
Interactive Object RemovalRORD
LPIPS25.26
45
Video GenerationPhysics-IQ
Phys. IQ Score18.02
45
Foreground segmentationPascal-5i (3)
mIoU47.66
25
Video GenerationKinetics-600
FVD356.5
22
Foreground segmentationPascal-5i (1)
mIoU48.94
16
Foreground segmentationPascal-5i (2)
mIoU51.29
13
Interactive Semantic SegmentationPASCAL VOC 2012
Accuracy (Bbox)8.73
10
Interactive Super-resolutionADE20K Bounding Box Interaction
LPIPS62.32
9
Showing 10 of 18 rows

Other info

Code

Follow for update