Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

About

Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab-klc/LVLM-Language-Bias.

Yangneng Chen, Jing Li• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy55.1	1863
Multimodal Understanding	MMStar	Accuracy44.7	511
OCR Evaluation	OCRBench	Score37.2	350
Hallucination Evaluation	MMHal-Bench	MMHal Score3.01	309
Visual Question Answering	GQA	Accuracy63.6	218
Real-world Visual Question Answering	RealworldQA	Accuracy55.5	183
Image Captioning	TextCaps	CIDEr105.8	154
Science Question Answering	ScienceQA SQA-I	Accuracy72.1	149
Multimodal Understanding	SEED-Bench Image	Accuracy71.5	143
Multimodal Hallucination Evaluation	MMHal-Bench	Average Score2.91	140

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord