JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
About
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and data are available at https://javisverse.github.io/JavisDiT-page/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Joint audio-video generation | JavisBench 1.0 (test) | AV-IB0.197 | 18 | |
| Text-to-Audio-Video Generation | Verse-Bench | MS0.18 | 16 | |
| Joint Video-Audio Generation | Landscape (test) | FVD94.2 | 9 | |
| Audio-to-video generation (A2V) | AIST++ (test) | FVD86.7 | 6 | |
| Video-audio synchrony classification | JavisBench 1.0 (val) | AUROC0.6533 | 5 | |
| Text-to-Audio-Video Generation | JavisBench mini (test) | FVD327.8 | 5 | |
| Audio Generation | JavisBench mini OoD 4s (test) | FAD8.11 | 3 | |
| Audio Generation | AudioCaps InD, 10s (test) | FAD5.19 | 3 | |
| Audio Generation | AudioCaps InD, 4s (test) | FAD6.23 | 2 |