USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

About

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He• 2025

Related benchmarks

Task	Dataset	Result
Subject-driven image generation	DreamBench	DINO Score74.78	113
Multi-image Reasoning	OmniContext	Single Scene Char Score8.03	20
Identity-preserving Image Generation	MultiID-Bench 1-people	Sim(GT)0.401	18
Product poster generation	InnoComposer-Bench 1.0 (test)	IR-Score0.911	14
Multi-Reference Image Editing	MICo-Bench	Object Score38.18	14
Outfit Generation	VITON-HD	LPIPS0.585	13
Outfit Generation	Fashion130K	LPIPS0.656	12
Subject-driven image generation	SconeEval	Composition Single COM8.03	11
Subject-consistent image generation	OmniContext	Fidelity (Single, Character)7.71	10
Image Stylization	Custom Triplet Dataset 21 styles (test)	CLIP Score69.39	9

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord