Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation
About
In this paper, we present TextCrafter, a Complex Visual Text Generation (CVTG) framework inspired by selective visual attention in cognitive science, and introduce the "Text Insulation-and-Attention" mechanisms. To implement the selective-attention principle that selection operates on discrete objects, we propose a novel Bottleneck-aware Constrained Reinforcement Learning for Multi-text Insulation, which substantially improves text-rendering performance on the strong Qwen-Image pretrained model without introducing additional parameters. To align with the selective concentration principle in human vision, we introduce a text-oriented attention module with a novel Quotation-guided Attention Gate that further improves generation quality for each text instance. Our Reinforcement Learning based text insulation approach attains state-of-the-art results, and incorporating text-oriented attention yields additional gains on top of an already strong baseline. More importantly, we introduce CVTG-2K, a benchmark comprising 2,000 complex visual-text prompts. These prompts vary in positions, quantities, lengths, and attributes, and span diverse real-world scenarios. Extensive evaluations on CVTG-2K, CVTG-Hard, LongText-Bench, and Geneval datasets confirm the effectiveness of TextCrafter. Despite using substantially fewer resources (i.e., 4 GPUs) than industrial-scale models (e.g., Qwen-Image, GPT Image, and Seedream), TextCrafter achieves superior performance in mitigating text misgeneration, omissions, and hallucinations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Rendering | CVTG-2K | NED90.38 | 28 | |
| Text-to-Image Generation | CVTG | Accuracy76 | 8 | |
| Text Rendering | Standard-text datasets (test) | Sentence Accuracy36.3 | 6 | |
| Text Rendering | ChineseDrawText (test) | Sentence Accuracy34.1 | 4 | |
| Text Rendering | DrawTextCreative (test) | Sentence Accuracy31.2 | 4 | |
| Text Rendering | TMDBEval500 (test) | Sentence Accuracy41 | 4 |