Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
About
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy88.1 | 2019 | |
| Multimodal Capability Evaluation | MM-Vet | Score40.8 | 393 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | Accuracy38.7 | 216 | |
| Multimodal Understanding | SEED-Bench Image | Accuracy69.4 | 143 | |
| Mathematical Reasoning | MathVista mini | Accuracy32.1 | 135 | |
| Vision-centric Reasoning | RealworldQA | Accuracy62.9 | 66 | |
| Multimodal Understanding | MME Perception | -- | 59 | |
| Multimodal Understanding | MME Cognition | Score362.9 | 45 | |
| Multimodal Understanding | LLaVAW | Score74.6 | 24 | |
| Computer Vision Reasoning | CV-Bench-3D | Accuracy71.8 | 11 |