DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

About

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models. Code and Model: https://github.com/aim-uofa/Diception

Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen• 2025

Related benchmarks

Task	Dataset	Result
Depth Estimation	KITTI	--	184
Depth Estimation	ScanNet	AbsRel0.075	133
Surface Normal Estimation	NYU V2	Mean Angular Error18.3	96
Depth Estimation	DIODE	Delta-1 Accuracy74.1	92
Affine-invariant depth estimation	ETH3D	AbsRel5.3	71
Affine-invariant depth estimation	NYU V2	AbsRel7.2	71
Affine-invariant depth estimation	ScanNet	AbsRel7.5	69
Depth Estimation	ETH3D	AbsRel0.053	25
Affine-invariant depth estimation	KITTI	AbsRel7.5	25
Depth Estimation	ETH3D	Absolute Relative Error (AbsRel)5	23

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord