Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLaVA-Video: Video Instruction Tuning With Synthetic Data

About

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.2
1455
Video UnderstandingMVBench
Accuracy100
425
Video Question AnsweringActivityNet-QA
Accuracy56.5
376
Video Question AnsweringActivityNet-QA (test)
Accuracy56.5
288
Long Video UnderstandingLongVideoBench
Score100
248
Video UnderstandingVideoMME
Score (Long)53.1
248
Video UnderstandingVideoMME
Overall Score75.9
222
Video UnderstandingMLVU
Score67.4
221
3D Question AnsweringScanQA (val)
METEOR17.7
217
Video Question AnsweringVideoMME
Accuracy64.4
210
Showing 10 of 295 rows
...

Other info

Code

Follow for update