Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

About

Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.

Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Audio Retrieval	AudioCaps	Recall@135.2	57
Audio-to-Text Retrieval	Clotho	R@118.3	49
Text-to-Audio Retrieval	Clotho	R@10.158	31
Audio-to-Text Retrieval	AudioCaps	R@145.1	22
Audio-to-Text Retrieval	ESC50	Recall@195	3
Audio-to-Text Retrieval	FSD50K	R@169.7	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord