Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

About

We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.

Qingxiao Li, Lifeng Xu, QingLi Wang, Yudong Bai, Mingwei Ou, Shu Hu, Nan Xu• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVision
Accuracy77.7
254
Multimodal ReasoningMMMU
Accuracy83.4
208
Visual ReasoningV*
Accuracy92.7
52
General Visual ReasoningMME-RealWorld-Lite
Accuracy67.1
37
High-Resolution Visual ReasoningHR-Bench-8K
Accuracy93.5
28
Visual ReasoningHRBench 4K
Accuracy91.38
14
Visual ReasoningMME-RW Chinese
Accuracy77.7
14
Physics-Scene Visual ReasoningPhysics
Accuracy54.35
10
Scientific Multimodal ReasoningVRSBench
Accuracy74.32
10
Scientific Multimodal ReasoningGMAI
Accuracy62.13
10
Showing 10 of 13 rows

Other info

Follow for update