MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

About

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

Xinhan Zheng, Huyu Wu, Xueting Wang, Duo Su, Haiyun Jiang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	WeMath	Accuracy70.75	317
Mathematical Reasoning	MathVision	Accuracy30.19	168
Mathematical Reasoning	DynaMath	Accuracy41.42	146
Multi-modal Understanding	MMBench EN	Accuracy93.53	113
OCR Visual Question Answering	TextVQA	Accuracy73.34	88
Document-oriented Visual Question Answering	DocVQA	Accuracy92.88	84
Visual Question Answering	VQA-RAD	Overall Accuracy63.64	67
OCR-based Visual Question Answering	OCRVQA	Mean Accuracy81.46	63
Visual Question Answering	ST-VQA	Accuracy84.37	42
Chart Question Answering	ChartQA	Accuracy80.84	37

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord