Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

About

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	T2I-CompBench++	Color0.7535	99
Text-to-Image Generation	HPS v3	Overall Score10.48	48
Detail Binding	T2I-CompBench++	BLIP-VQA Color Score80.27	21
Style Composition	Style Composition Benchmark (SCB)	SCB Score25.12	20
Attribute Binding	T2I-CompBench attribute binding	Color Binding Score81.59	7
Image Quality Assessment	T2I-CompBench	MUSIQ73.29	7
Text-to-Image Generation	T2I-CompBench	Non-spatial Fidelity0.3164	7
Image Quality Assessment	GenEval	MUSIQ Score75.54	7
Text-to-Image Generation	GenEval	Accuracy (2 objects)89	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord