Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

About

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Lijun Yu, Jos\'e Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang• 2023

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)319.4
967
Image GenerationImageNet 256x256
IS319.4
517
Class-conditional Image GenerationImageNet 256x256 (val)
Inception Score (IS)319.4
493
Image GenerationImageNet 256x256 (val)
FID1.78
399
Class-conditional Image GenerationImageNet 256x256 (train)
IS319.4
367
Image GenerationImageNet 512x512 (val)
FID-50K1.91
219
Class-conditional Image GenerationImageNet 256x256 (train val)
FID1.78
203
Image ReconstructionImageNet 256x256
rFID0.9
202
Class-conditional Image GenerationImageNet--
174
Image GenerationImageNet-1K 256x256 (val)
Inception Score319.4
144
Showing 10 of 49 rows

Other info

Code

Follow for update