Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

About

Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.

Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Audio RetrievalAudioCaps
Recall@135.2
57
Audio-to-Text RetrievalClotho
R@118.3
49
Text-to-Audio RetrievalClotho
R@10.158
31
Audio-to-Text RetrievalAudioCaps
R@145.1
22
Audio-to-Text RetrievalESC50
Recall@195
3
Audio-to-Text RetrievalFSD50K
R@169.7
3
Showing 6 of 6 rows

Other info

Follow for update