Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

About

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Doll\'ar, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo
R@10.516
360
Text-to-Video RetrievalMSVD
R@160.8
218
Text-to-Audio RetrievalAudioCaps (test)
Recall@145.8
145
Video-to-Text retrievalDiDeMo
R@151.7
108
Text-to-Video RetrievalVATEX
R@195.1
95
Video-to-Text retrievalMSVD
R@188.4
93
Audio-to-Text RetrievalClotho (test)
R@132.7
78
Video-to-Text retrievalVATEX
Recall@194.8
68
Audio ClassificationESC50
Top-1 Acc96
64
Video RetrievalUCF101
Top-1 Acc90.4
63
Showing 10 of 74 rows
...

Other info

Follow for update