Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

About

Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.

Linge Wang, Yingying Chen, Bingke Zhu, Lu Zhou, Jinqiao Wang• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual ClassificationVGGSound
Top-1 Acc52.7
37
Audio-to-Video RetrievalVGGSound (test)
Recall@130.3
13
Video-to-Audio RetrievalVGGSound (test)
Recall@131.3
11
Audio-Visual Event ClassificationAudioSet 20K--
11
Audio-to-Vision RetrievalAudioSet (eval)
Recall@137.1
9
Vision-to-Audio RetrievalAudioSet (eval)
Recall@137.4
9
Showing 6 of 6 rows

Other info

Follow for update