Sequential Modeling Enables Scalable Learning for Large Vision Models

About

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros• 2023

Related benchmarks

Task	Dataset	Result
Depth Estimation	NYU Depth V2	--	209
Surface Normal Prediction	NYU V2	Mean Error23.433	123
Video Generation	Physics-IQ	Phys. IQ Score18.02	63
Interactive Object Removal	RORD	LPIPS25.26	45
Foreground segmentation	Pascal-5i (3)	mIoU47.66	25
Video Generation	Kinetics-600	FVD356.5	22
Foreground segmentation	Pascal-5i (1)	mIoU48.94	16
Foreground segmentation	Pascal-5i (2)	mIoU51.29	13
Interactive Semantic Segmentation	PASCAL VOC 2012	Accuracy (Bbox)8.73	10
Interactive Super-resolution	ADE20K Bounding Box Interaction	LPIPS62.32	9

Showing 10 of 18 rows

Other info

Code

Follow for update

@wizwand_team Discord