Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaVA-Video: Video Instruction Tuning With Synthetic Data

About

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA
Accuracy56.5
319
Video Question AnsweringActivityNet-QA (test)
Accuracy56.5
275
Video UnderstandingMVBench
Accuracy64.1
247
Video Question AnsweringNExT-QA (test)
Accuracy83.2
204
Video UnderstandingVideoMME
Overall Score75.9
192
Long Video UnderstandingLongVideoBench (val)
Accuracy58.2
139
3D Question AnsweringScanQA (val)
CIDEr88.7
133
Long Video UnderstandingLongVideoBench
Score58.2
110
Video Question AnsweringNEXT-QA
Overall Accuracy83.2
105
Video Question AnsweringNExT-QA Multi-choice
Accuracy83.2
102
Showing 10 of 173 rows
...

Other info

Code

Follow for update