Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

About

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.7	2019
Visual Question Answering	VizWiz	Accuracy57.6	1820
Visual Question Answering	TextVQA	Accuracy51.8	1453
Visual Question Answering	VQA v2	Accuracy81.8	1429
Visual Question Answering	GQA	Accuracy60.3	1425
Text-based Visual Question Answering	TextVQA	Accuracy61.3	962
Multimodal Understanding	MMBench	Accuracy64.2	847
Science Question Answering	ScienceQA	Accuracy70	791
Multimodal Evaluation	MME	Score1.84e+3	727
Multimodal Understanding	MM-Vet	MM-Vet Score32	631

Showing 10 of 365 rows

...

Other info

Code

Follow for update

@wizwand_team Discord