Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LaVin-DiT: Large Vision Diffusion Transformer

About

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models are available.

Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu• 2024

Related benchmarks

TaskDatasetResultRank
Depth EstimationNYU Depth V2--
177
Surface Normal PredictionNYU V2
Mean Error15.901
100
Foreground segmentationPascal-5i (1)
mIoU67.87
16
Foreground segmentationPascal-5i (2)
mIoU75.8
13
Foreground segmentationPascal-5i (3)
mIoU66.98
13
InpaintingImageNet
FID1.65
8
ColorizationImageNet
MSE0.24
7
Foreground segmentationPascal-5i Split 4
mIoU66.9
4
Single Object DetectionPascal-5i (Split 1)
mIoU67.85
4
Single Object DetectionPascal-5i (Split 2)
mIoU69.32
4
Showing 10 of 12 rows

Other info

Code

Follow for update