Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
About
Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Retrieval | AudioCaps | Recall@135.2 | 57 | |
| Audio-to-Text Retrieval | Clotho | R@118.3 | 49 | |
| Text-to-Audio Retrieval | Clotho | R@10.158 | 31 | |
| Audio-to-Text Retrieval | AudioCaps | R@145.1 | 22 | |
| Audio-to-Text Retrieval | ESC50 | Recall@195 | 3 | |
| Audio-to-Text Retrieval | FSD50K | R@169.7 | 3 |