Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

About

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models. Code and Model: https://github.com/aim-uofa/Diception

Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen• 2025

Related benchmarks

TaskDatasetResultRank
Depth EstimationScanNet
AbsRel0.075
94
Depth EstimationKITTI
AbsRel0.075
92
Depth EstimationDIODE
Delta-1 Accuracy74.1
62
Depth EstimationNYU
AbsRel0.072
20
Depth EstimationETH3D
AbsRel0.053
19
Transparent object normal estimationClearPose Real-World (test)
Mean Angular Error31
13
Transparent object normal estimationTransNormal Synthetic (test)
Mean Angular Error7.1
13
Transparent object normal estimationClearGrasp Synthetic (test)
Mean Angular Error29.5
13
Showing 8 of 8 rows

Other info

Follow for update