Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

About

Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation. Code at https://github.com/cosbidev/Text2CT.

Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image AlignmentCT-RATE (test)
CLIP-Score25.8
10
Image-to-Image AlignmentCT-RATE (test)
CLIP-Score72.37
8
Text-to-CT GenerationCT-RATE (test)
FID 2.5D (Axial)0.5
8
Clinical Consistency EvaluationCT-RATE (test)
AUC (Macro)74.5
7
3D Image GenerationCT-Rate Vessel window (test)
FID (XY)8.91
4
3D Image GenerationCT-Rate Soft Tissue window (test)
FID (XY)9.12
4
3D Image GenerationCT-Rate Lung window (test)
FID (XY)11.51
4
3D Image GenerationCT-Rate Bone window (test)
FID (XY)8.67
4
Showing 8 of 8 rows

Other info

Follow for update