VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

About

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.8	2019
Visual Question Answering	TextVQA	Accuracy60.8	1453
Visual Question Answering	GQA	Accuracy60.8	1425
Text-based Visual Question Answering	TextVQA	Accuracy60.8	962
Multimodal Understanding	MMBench	--	847
Multimodal Evaluation	MME	--	727
Multimodal Understanding	MM-Vet	MM-Vet Score33.5	631
Multimodal Understanding	SEED-Bench	Accuracy59	516
Text-to-Image Generation	GenEval	GenEval Score42	442
Multimodal Understanding	MMMU	Accuracy33.5	437

Showing 10 of 114 rows

...

Other info

Follow for update

@wizwand_team Discord