LLaVA-Video: Video Instruction Tuning With Synthetic Data
About
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | ActivityNet-QA | Accuracy56.5 | 319 | |
| Video Question Answering | ActivityNet-QA (test) | Accuracy56.5 | 275 | |
| Video Understanding | MVBench | Accuracy64.1 | 247 | |
| Video Question Answering | NExT-QA (test) | Accuracy83.2 | 204 | |
| Video Understanding | VideoMME | Overall Score75.9 | 192 | |
| Long Video Understanding | LongVideoBench (val) | Accuracy58.2 | 139 | |
| 3D Question Answering | ScanQA (val) | CIDEr88.7 | 133 | |
| Long Video Understanding | LongVideoBench | Score58.2 | 110 | |
| Video Question Answering | NEXT-QA | Overall Accuracy83.2 | 105 | |
| Video Question Answering | NExT-QA Multi-choice | Accuracy83.2 | 102 |