Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

About

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Accuracy67.6
257
Mathematical Multimodal ReasoningMathVista
Accuracy75.7
218
Multimodal Math ReasoningMathVision
Accuracy46.7
183
Multimodal ReasoningMMMU
Accuracy69.2
130
ReasoningGSM8K
Accuracy0.757
106
General ReasoningBBH
BBH General Reasoning Accuracy80.6
98
ReasoningMATH 500
Accuracy (%)67
90
General ReasoningSuper GPQA
Accuracy53.2
89
Multimodal ReasoningMMBench--
78
Math ReasoningGSM8K
Accuracy (GSM8K)98.9
49
Showing 10 of 23 rows

Other info

GitHub

Follow for update