Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

About

Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.6
2019
Multimodal UnderstandingMMBench
Accuracy84.8
847
Science Question AnsweringScienceQA--
791
Multimodal EvaluationMME--
727
Multimodal ReasoningMM-Vet
MM-Vet Score50.9
517
Mathematical ReasoningMathVista
Score65.6
474
Multimodal UnderstandingMMMU
Accuracy57.4
437
Multimodal UnderstandingMMStar
Accuracy64.6
407
Diagram Question AnsweringAI2D
AI2D Accuracy86.6
387
GUI GroundingScreenSpot v2
Avg Accuracy89.5
371
Showing 10 of 98 rows
...

Other info

Follow for update