Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

About

The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.

Siyuan Fu, Xuchen Guo, Mingjun Liu, Hongxiang Li, Boyin Tan, Gongxi Zhu, Xianwei Zhuang, Jinghan Ru, Yuxin Xie, Yuguo Yin• 2025

Related benchmarks

TaskDatasetResultRank
Audio-to-Text RetrievalClotho (test)
R@114.1
78
Text-to-Audio RetrievalClotho (test)
R@111.9
62
Audio RetrievalAudioCaps
R@143.7
42
Audio-to-Text RetrievalClotho
R@10.195
4
Showing 4 of 4 rows

Other info

Follow for update