ThinkGen: Generalized Thinking for Visual Generation
About
Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score89 | 467 | |
| Text-to-Image Generation | DPG-Bench | Overall Score85.87 | 173 | |
| Reasoning Image Editing | RiseBench 1.0 (test) | Temporal Score16.4 | 30 | |
| Reasoning Generation | WISE 1.0 (test) | Overall Score76 | 17 | |
| Image Editing | ImgEdit (test) | Add Score4.75 | 14 | |
| Text-to-Image Generation | CVTG | Accuracy84 | 8 |