Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

About

Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

Yiming Ren, Yujiu Yang, Junjie Wang• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Mathematical ReasoningMathVista
Score54
385
Optical Character RecognitionOCRBench--
232
Multimodal ReasoningMMMU
Accuracy44.4
130
Multimodal PerceptionMME
Perception Score83.2
43
Real-world Visual UnderstandingRealworldQA
Score0.639
29
Chart UnderstandingChartQA
Score77.9
23
Visual Question AnsweringTextVQA
Score77.6
20
Diagram ReasoningAI2D
Score75.3
16
Scientific ReasoningScienceQA
Score83.4
13
Showing 10 of 11 rows

Other info

Follow for update