JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
About
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench (test) | Accuracy68.4 | 97 | |
| Video Question Answering | ActivityNet (test) | Accuracy58.1 | 57 | |
| Video Perception | Perception (test) | Accuracy70.2 | 36 | |
| Audio Question Answering | ClothoAQA (test) | Accuracy67.3 | 14 | |
| Audio-Visual Question Answering | AVQA (test) | Total Accuracy93.8 | 13 | |
| Audio-Video Understanding | MU-AVQA (test) | Accuracy82.1 | 9 | |
| Audio-Video Understanding | AVSD (test) | Accuracy62.2 | 9 | |
| Audio Comprehension | TUT 2017 (test) | Accuracy0.821 | 8 | |
| Text-to-Audio-Video Generation | JavisBench mini (test) | FVD317.5 | 5 |