Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
About
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | DiDeMo | R@10.516 | 360 | |
| Text-to-Video Retrieval | MSVD | R@160.8 | 218 | |
| Text-to-Audio Retrieval | AudioCaps (test) | Recall@145.8 | 145 | |
| Video-to-Text retrieval | DiDeMo | R@151.7 | 108 | |
| Text-to-Video Retrieval | VATEX | R@195.1 | 95 | |
| Video-to-Text retrieval | MSVD | R@188.4 | 93 | |
| Audio-to-Text Retrieval | Clotho (test) | R@132.7 | 78 | |
| Video-to-Text retrieval | VATEX | Recall@194.8 | 68 | |
| Audio Classification | ESC50 | Top-1 Acc96 | 64 | |
| Video Retrieval | UCF101 | Top-1 Acc90.4 | 63 |