Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

About

Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

Hongyang Wei, Baixin Xu, Hongbo Liu, Size Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, Chuanxin Tang, Zidong Wang, Yichen Wei, Liang Hu, Boyi Jiang, Wei Li, Ying He, Yang Liu, Xuchen Song, Yangguang Li, Yahui Zhou• 2025

Related benchmarks

Task	Dataset	Result
Image Editing	ImgEdit-Bench	Overall Score4.06	256
Image Editing	GEdit-Bench	Semantic Consistency7.63	102
Image Editing	GEdit-Bench-EN (full)	G-Score (O)7.1	84
Reasoning-informed Image Editing	RISE-Bench	Temporal Score2.3	72
Understanding Enhances Generation	RealUnify	WK62	9
Generation Enhances Understanding	RealUnify	MR-II28	9
Instruction-based Image Editing	KRIS-Bench 1.0 (test)	Attribute Perception70.78	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord