Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

About

In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang• 2025

Related benchmarks

Task	Dataset	Result
High-Resolution Image Generation	Aesthetic-4K	IR0.87	64
Text-to-Image Generation	4K Resolution 4K x 4K (test)	CLIP IQA Score0.3012	16
4K ultra-high-resolution image generation	UltraHR-eval4k	FID41.69	6
Text-to-Image Generation	Aesthetic-Eval@4096 (test)	FID152.4	5
Text-to-Image Synthesis	Aesthetic-Eval 2K resolution (test)	gFID39.49	5
Text-to-Image Synthesis	Aesthetic-Eval 4K resolution (test)	gFID151.9	5

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord