Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLaVA-Video: Video Instruction Tuning With Synthetic Data

About

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.2
2019
Video UnderstandingMVBench
Accuracy100
563
Video Question AnsweringActivityNet-QA
Accuracy56.5
418
Video UnderstandingVideoMME
Score (Overall)64.2
357
3D Question AnsweringScanQA (val)
CIDEr88.7
290
Video Question AnsweringActivityNet-QA (test)
Accuracy56.5
288
Long Video UnderstandingLongVideoBench
Score100
269
Streaming Video UnderstandingStreamingBench
Overall66.96
259
Spatial ReasoningVSI-Bench
Avg Score40.9
255
Video Question AnsweringVideoMME
Accuracy64.4
251
Showing 10 of 349 rows
...

Other info

Code

Follow for update