Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation

About

In this paper, we present TextCrafter, a Complex Visual Text Generation (CVTG) framework inspired by selective visual attention in cognitive science, and introduce the "Text Insulation-and-Attention" mechanisms. To implement the selective-attention principle that selection operates on discrete objects, we propose a novel Bottleneck-aware Constrained Reinforcement Learning for Multi-text Insulation, which substantially improves text-rendering performance on the strong Qwen-Image pretrained model without introducing additional parameters. To align with the selective concentration principle in human vision, we introduce a text-oriented attention module with a novel Quotation-guided Attention Gate that further improves generation quality for each text instance. Our Reinforcement Learning based text insulation approach attains state-of-the-art results, and incorporating text-oriented attention yields additional gains on top of an already strong baseline. More importantly, we introduce CVTG-2K, a benchmark comprising 2,000 complex visual-text prompts. These prompts vary in positions, quantities, lengths, and attributes, and span diverse real-world scenarios. Extensive evaluations on CVTG-2K, CVTG-Hard, LongText-Bench, and Geneval datasets confirm the effectiveness of TextCrafter. Despite using substantially fewer resources (i.e., 4 GPUs) than industrial-scale models (e.g., Qwen-Image, GPT Image, and Seedream), TextCrafter achieves superior performance in mitigating text misgeneration, omissions, and hallucinations.

Ying Tai, Nikai Du, Rui Xie, Zhennan Chen, Qian Wang, Zhengkai Jiang, Kai Zhang, Jian Yang• 2025

Related benchmarks

Task	Dataset	Result
Text Rendering	CVTG-2K	NED90.38	75
Text Rendering	CVTG-2K (test)	NED86.79	23
Text-to-Image Generation	GlyphBanana-Bench	OCR Accuracy34	10
Text-to-Image Generation	CVTG	Accuracy76	8
Text Rendering	Standard-text datasets (test)	Sentence Accuracy36.3	6
Text Rendering	ChineseDrawText (test)	Sentence Accuracy34.1	4
Text Rendering	DrawTextCreative (test)	Sentence Accuracy31.2	4
Text Rendering	TMDBEval500 (test)	Sentence Accuracy41	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord