LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

About

In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

Yanwei Li, Chengyao Wang, Jiaya Jia• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy79.1	2019
Visual Question Answering	VizWiz	Accuracy54.3	1820
Visual Question Answering	VQA v2	Accuracy80	1429
Visual Question Answering	GQA	Accuracy65	1425
Multimodal Understanding	MMBench	Accuracy55.5	847
Video Understanding	MVBench	Accuracy41.9	563
Visual Question Answering	GQA	Accuracy52.9	524
Multimodal Understanding	SEED-Bench	Accuracy62.3	516
Video Question Answering	MSRVTT-QA	Accuracy58.9	505
Video Question Answering	ActivityNet-QA	Accuracy49.1	418

Showing 10 of 187 rows

...

Other info

Code

Follow for update

@wizwand_team Discord