Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Explicit Critic Guidance for Aligning Diffusion Models

About

Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

Zhengyang Liang, Qihang Zhang, Ceyuan Yang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
GenEval Score97.58
442
Text-to-Image GenerationGeneral Prompts Alignment & Preference
CLIP Score0.3431
8
Text-to-Image GenerationMulti-reward CLIP, HPSv2.1, and GenEval (test)
CLIP Score28.96
7
Text-to-Image GenerationOCR-based text-rendering
OCR Score26.48
7
Showing 4 of 4 rows

Other info

Follow for update