Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

About

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi, Jiaming Wang, Zirui Song, Fajri Koto, Xiuying Chen• 2026

Related benchmarks

TaskDatasetResultRank
Text RenderingVisual text scenarios (evaluation set)
NED88.93
10
Text-to-Image GenerationTextAlign General Generation Benchmark
CLIPScore31.56
10
Text RenderingMARIO-Eval 500-sample (external)
NED0.2686
7
Showing 3 of 3 rows

Other info

Follow for update