Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

About

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy89.32
1455
Multimodal ReasoningMMStar--
143
Visual Question AnsweringSimpleVQA
Accuracy0.4
99
Hallucination assessmentHallusionBench--
39
Multimodal ReasoningMMBench CN
Accuracy81.5
36
Instruction FollowingMM-IFEval
Score57
28
Multimodal ReasoningMMBench EN
Accuracy83
24
Generative Hallucination EvaluationAMBER
Score90.79
14
OCR ParsingOCRBench EN v2
Score48.5
14
OCR ParsingOCRBench ZH v2
Overall Score52
14
Showing 10 of 10 rows

Other info

Follow for update