Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning

About

Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement learning, incurring substantial cost. An appealing alternative is parameter-space model merging between reasoning-enhanced LLMs and MLLMs, but we show that naive merging is fragile: its effectiveness varies widely across model families and can significantly degrade performance (e.g., for Qwen-based MLLMs). We propose Directional Reasoning Injection for Fine-Tuning (DRIFT), a lightweight method that transfers reasoning knowledge in the gradient space while preserving multimodal alignment. DRIFT precomputes a reasoning prior from the parameter differences between text-only reasoning experts and multimodal models, and uses it to bias gradients during supervised fine-tuning. This design retains the simplicity of standard SFT pipelines while enabling efficient and stable reasoning transfer. Experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, show that DRIFT consistently outperforms naive merging and standard SFT, and matches or surpasses training-intensive methods with substantially lower data and compute.

Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu• 2025

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Score69.9
46
Visual ReasoningMathVerse--
40
Visual ReasoningLogicVista--
26
Visual Math ReasoningMathVision
Score26.6
24
Visual ReasoningWeMath strict
Score38.5
12
Showing 5 of 5 rows

Other info

Follow for update