Synchformer: Efficient Synchronization from Sparse Cues
About
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video-audio synchrony classification | JavisBench 1.0 (val) | AUROC0.5742 | 5 | |
| Audio-Video Synchronization | Sora 2 | Score0.167 | 3 | |
| Audio-Video Synchronization | Veo 3 | Score0.191 | 3 |
Showing 3 of 3 rows