LLaVA-Video: Video Instruction Tuning With Synthetic Data

About

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.2	2019
Video Understanding	MVBench	Accuracy100	563
Video Question Answering	ActivityNet-QA	Accuracy56.5	418
Video Understanding	VideoMME	Score (Overall)64.2	357
3D Question Answering	ScanQA (val)	CIDEr88.7	290
Video Question Answering	ActivityNet-QA (test)	Accuracy56.5	288
Long Video Understanding	LongVideoBench	Score100	269
Streaming Video Understanding	StreamingBench	Overall66.96	259
Spatial Reasoning	VSI-Bench	Avg Score40.9	255
Video Question Answering	VideoMME	Accuracy64.4	251

Showing 10 of 349 rows

...

Other info

Code

Follow for update

@wizwand_team Discord