TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
About
Centred on content modification and style preservation, Scene Text Editing (STE) remains a challenging task despite considerable progress in text-to-image synthesis and text-driven image manipulation recently. GAN-based STE methods generally encounter a common issue of model generalization, while Diffusion-based STE methods suffer from undesired style deviations. To address these problems, we propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text glyph structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. (ii) To further leverage the style prior, a Glyph-adaptive Mutual Self-attention mechanism is proposed which deconstructs the implicit fine-grained features of the source image to enhance style consistency and vision quality during inference. Furthermore, to fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons. Experiments demonstrate the effectiveness of TextCtrl compared with previous methods concerning both style fidelity and text accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Text Recognition | 6 common benchmarks (test) | Word Accuracy (IIIT)70 | 57 | |
| Text Style Fidelity Assessment | ScenePair Full-size Image | SSIM99.07 | 9 | |
| Scene Text Editing | English Scene Text Editing Dataset (test) | Sen.Acc76.54 | 8 | |
| Scene Text Editing | English ScenePair (test) | W.Acc78.91 | 7 | |
| Text Image Generation | ScenePair | ACC84.67 | 6 | |
| Text rendering accuracy | ScenePair 1.0 (test) | Accuracy (%)84.67 | 6 | |
| Text rendering accuracy | ScenePair Random 1.0 (test) | Accuracy66.95 | 6 | |
| Text Style Fidelity Assessment | ScenePair Cropped Text Image | SSIM37.56 | 6 | |
| Text rendering accuracy | TamperScene 1.0 (test) | Accuracy74.17 | 3 |