Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation
About
Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | -- | 1455 | |
| Mathematical Reasoning | MathVista | Score54 | 385 | |
| Optical Character Recognition | OCRBench | -- | 232 | |
| Multimodal Reasoning | MMMU | Accuracy44.4 | 130 | |
| Multimodal Perception | MME | Perception Score83.2 | 43 | |
| Real-world Visual Understanding | RealworldQA | Score0.639 | 29 | |
| Chart Understanding | ChartQA | Score77.9 | 23 | |
| Visual Question Answering | TextVQA | Score77.6 | 20 | |
| Diagram Reasoning | AI2D | Score75.3 | 16 | |
| Scientific Reasoning | ScienceQA | Score83.4 | 13 |