Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

About

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Visual UnderstandingV*Bench
Avg@8 EM0.88
18
Visual UnderstandingHR-Bench-8K
Avg@8 Exact Match79.8
17
Visual UnderstandingHR-Bench-4K
Avg@8 Exact Match82.8
17
Multimodal UnderstandingMME-RW-en (test)
Overall Score71.4
15
Visual PerceptionHR-8K (test)
Accuracy79.8
15
Visual PerceptionHR-4K (test)
Accuracy82.8
15
Visual PerceptionVSTAR (test)
Accuracy88
15
Visual UnderstandingMME RealWorld
Pass@1 Exact Match71.4
13
Visual UnderstandingV* Bench, HR-Bench, and MME RealWorld
Average Score80.5
13
Showing 9 of 9 rows

Other info

GitHub

Follow for update