EVLM: An Efficient Vision-Language Model for Visual Understanding
About
In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy67.5 | 1117 | |
| Visual Question Answering | GQA | Accuracy64.4 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy89.7 | 935 | |
| Visual Question Answering | VQAv2 | Accuracy81.9 | 177 | |
| Diagram Understanding | AI2D | Accuracy76 | 167 | |
| Multimodal Understanding | MMBench CN | -- | 162 | |
| Multi-modal Understanding | MMBench EN | Overall Score76.9 | 39 | |
| Visual Question Answering | VizWizQA | Accuracy47.3 | 21 |