Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
About
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score63 | 506 | |
| Text-to-Image Generation | GenEval | GenEval Score63 | 360 | |
| Text-to-Image Generation | DPG-Bench | Overall Score78.9 | 265 | |
| Text-to-Image Generation | DPG | Overall Score78.87 | 172 | |
| Text-to-Image Generation | DPG-Bench (test) | Global Fidelity84.59 | 58 | |
| Text-to-Image Generation | DPGBench | Attribute Score88.01 | 44 | |
| Text-to-Image Generation | GenAI-Bench | Basic Score0.818 | 41 | |
| Text-to-Image Alignment | DPG | Overall78.87 | 39 | |
| Text-to-Image Generation | HPS v3 | Overall Score8.19 | 24 | |
| Dense prompt following | DPG-Bench v1.0 (test) | Entity Score80.59 | 20 |