USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
About
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Subject-driven image generation | DreamBench | DINO Score74.78 | 113 | |
| Multi-image Reasoning | OmniContext | Single Scene Char Score8.03 | 20 | |
| Identity-preserving Image Generation | MultiID-Bench 1-people | Sim(GT)0.401 | 18 | |
| Product poster generation | InnoComposer-Bench 1.0 (test) | IR-Score0.911 | 14 | |
| Multi-Reference Image Editing | MICo-Bench | Object Score38.18 | 14 | |
| Outfit Generation | VITON-HD | LPIPS0.585 | 13 | |
| Outfit Generation | Fashion130K | LPIPS0.656 | 12 | |
| Subject-driven image generation | SconeEval | Composition Single COM8.03 | 11 | |
| Subject-consistent image generation | OmniContext | Fidelity (Single, Character)7.71 | 10 | |
| Image Stylization | Custom Triplet Dataset 21 styles (test) | CLIP Score69.39 | 9 |