Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation

About

Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.

Qi Chen, Shuhan Ding, Yu Gu, Nan Liu, Jiang Bian, Alan Yuille, Zongwei Zhou, Jingjing Fu• 2026

Related benchmarks

Task	Dataset	Result
Multi-label Abnormality Analysis	CT-RATE (test)	--	24
3D CT Segmentation	Task06 Lung	--	10
3D CT Segmentation	Task07 Pancreas	--	10
3D CT Segmentation	LiTS	--	10
3D CT Segmentation	KiTS 19	--	10
3D CT Reconstruction	Task06 Lung	--	9
3D CT Reconstruction	Task07 Pancreas	--	9
3D CT Reconstruction	LiTS	--	9
3D CT Reconstruction	KiTS19	--	9
3D CT Generation	CT-RATE ReXGroundingCT Normal (val)	FVD (CT-CLIP)0.3035	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord