UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
About
Style transfer must match a target style while preserving content semantics. DiT-based diffusion models often suffer from content-style entanglement, leading to reference-content leakage and unstable generation. We present UniCSG, a unified framework for content-constrained, style-driven generation in both text-guided and reference-guided settings. UniCSG employs staged training: (i) a latent-space semantic disentanglement stage that combines low-frequency preprocessing with conditioning corruption to encourage content-style separation, and (ii) a latent-space frequency-aware detail reconstruction stage that refines details via multi-scale frequency supervision. We further incorporate pixel-space reward learning to align latent objectives with perceptual quality after decoding. Experiments demonstrate improved content faithfulness, style alignment, and robustness in both settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Style Transfer | CSG-Bench | FID87.32 | 20 | |
| reference-guided style transfer | OmniConsistency-Bench | FID88.428 | 20 | |
| Controllable Style Generation | CSG-Bench Text-guided | Content Preference Rate29.6 | 9 | |
| Controllable Style Generation | CSG-Bench Reference-guided | Content Preference Rate38.3 | 9 |