Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

About

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren Song• 2025

Related benchmarks

TaskDatasetResultRank
Text RenderingMultilingual Benchmark English (test)
Character Precision99.68
7
Text-to-Image GenerationUser Study COCO-style benchmarks
Aesthetic Quality (Aes)7.0964
7
Text RenderingOneIG English
NED95.71
6
Visual Text RenderingGlyphCorrector Multilingual
Text Alignment Score83.2642
6
Visual Text RenderingGlyphCorrector Complex
Text Alignment88.2371
6
Text RenderingGlyphAcc-Multilingual English
NED0.978
6
Text RenderingGlyphAcc-Multilingual Korean
NED0.8544
6
Text RenderingGlyphAcc-Multilingual French
NED0.9671
6
Text RenderingGlyphAcc Complex
NED0.7645
6
Text RenderingGlyphAcc-Multilingual Chinese
NED0.9569
6
Showing 10 of 24 rows

Other info

Follow for update