Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

About

Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

Yu Xie, Jielei Zhang, Pengyu Chen, Weihang Wang, Longwen Gao, Peiyi Li, Qian Qiao, Zhouhui Lian• 2025

Related benchmarks

TaskDatasetResultRank
Multi-line Text ReconstructionAnyWord EN
Sequence Accuracy77.3
10
Multi-line Text ReconstructionAnyWord CH
Sequence Accuracy61.4
10
Multi-line Text ReconstructionTotalText
Sequence Accuracy62.9
10
Single-line Scene Text ReconstructionAnyWord EN
SeqAcc80.3
10
Multi-line Text EditingReCTS
SeqAcc37.2
5
Multi-line Text GenerationUser Study
US Score8
5
Multi-line Text ReconstructionReCTS
Sequence Accuracy64.1
5
Single-line Scene Text EditingAnyWord CH
Sequence Accuracy (SeqAcc)48.2
5
Single-line Scene Text EditingTotalText
SeqAcc45
5
Single-line Scene Text EditingReCTS
SeqAcc40.6
5
Showing 10 of 13 rows

Other info

Follow for update