T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

About

Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score79	704
Text-to-Image Generation	GenEval	Overall Score79	517
Text-to-Image Generation	T2I-CompBench	Shape Fidelity59.14	185
World Knowledge Image Generation	WISE	Overall Score54	110
Text-to-Image Generation	GenEval	Overall Score79	96
Text-to-Image Generation	GenEval++	Color Accuracy68	75
Text-to-Image Generation	DPG-Bench (test)	Overall Fidelity84.76	68
Knowledge-grounded reasoning	WISE	Overall Score54	68
Text-to-Image Generation	WISE	Cultural Score48	48
Text-to-Image Generation	T2I-CompBench	Color Fidelity81.3	46

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord