Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment

About

Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.

Kanghyun Baek, Sangyub Lee, Jin Young Choi, Jaewoo Song, Daemin Park, Jooyoung Choi, Chaehun Shin, Bohyung Han, Sungroh Yoon• 2025

Related benchmarks

TaskDatasetResultRank
Text RenderingStandard-text datasets (test)
Sentence Accuracy42.4
6
Text RenderingDrawTextCreative (test)
Sentence Accuracy37.2
4
Text RenderingTMDBEval500 (test)
Sentence Accuracy47.9
4
Text RenderingChineseDrawText (test)
Sentence Accuracy33.9
4
Text-to-Image GenerationStandard-text
FID126.3
3
Text Renderinglong-text datasets
NED0.546
3
Showing 6 of 6 rows

Other info

Follow for update