S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

About

We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.

Qingxiao Li, Lifeng Xu, QingLi Wang, Yudong Bai, Mingwei Ou, Shu Hu, Nan Xu• 2026

Related benchmarks

Task	Dataset	Result
Visual Mathematical Reasoning	MathVision	Accuracy77.7	298
Multimodal Reasoning	MMMU	Accuracy83.4	220
Multimodal Understanding	MMMU	Accuracy (MMMU)83.4	73
Visual Reasoning	V*	Accuracy92.7	72
General Visual Reasoning	MME-RealWorld-Lite	Accuracy67.1	47
Visual Reasoning	HRBench 4K	Accuracy91.38	41
High-Resolution Visual Reasoning	HR-Bench-8K	Accuracy93.5	28
Visual Reasoning	MME-RW Chinese	Accuracy77.7	14
Physics-Scene Visual Reasoning	Physics	Accuracy54.35	10
Scientific Multimodal Reasoning	VRSBench	Accuracy74.32	10

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord