Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation

About

In this paper, we present TextCrafter, a Complex Visual Text Generation (CVTG) framework inspired by selective visual attention in cognitive science, and introduce the "Text Insulation-and-Attention" mechanisms. To implement the selective-attention principle that selection operates on discrete objects, we propose a novel Bottleneck-aware Constrained Reinforcement Learning for Multi-text Insulation, which substantially improves text-rendering performance on the strong Qwen-Image pretrained model without introducing additional parameters. To align with the selective concentration principle in human vision, we introduce a text-oriented attention module with a novel Quotation-guided Attention Gate that further improves generation quality for each text instance. Our Reinforcement Learning based text insulation approach attains state-of-the-art results, and incorporating text-oriented attention yields additional gains on top of an already strong baseline. More importantly, we introduce CVTG-2K, a benchmark comprising 2,000 complex visual-text prompts. These prompts vary in positions, quantities, lengths, and attributes, and span diverse real-world scenarios. Extensive evaluations on CVTG-2K, CVTG-Hard, LongText-Bench, and Geneval datasets confirm the effectiveness of TextCrafter. Despite using substantially fewer resources (i.e., 4 GPUs) than industrial-scale models (e.g., Qwen-Image, GPT Image, and Seedream), TextCrafter achieves superior performance in mitigating text misgeneration, omissions, and hallucinations.

Ying Tai, Nikai Du, Rui Xie, Zhennan Chen, Qian Wang, Zhengkai Jiang, Kai Zhang, Jian Yang• 2025

Related benchmarks

TaskDatasetResultRank
Text RenderingCVTG-2K
NED90.38
28
Text-to-Image GenerationCVTG
Accuracy76
8
Text RenderingStandard-text datasets (test)
Sentence Accuracy36.3
6
Text RenderingChineseDrawText (test)
Sentence Accuracy34.1
4
Text RenderingDrawTextCreative (test)
Sentence Accuracy31.2
4
Text RenderingTMDBEval500 (test)
Sentence Accuracy41
4
Showing 6 of 6 rows

Other info

Follow for update