Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval

About

The dominant paradigm for Audio-Text Retrieval (ATR) relies on dual-encoder architectures optimized via mini-batch contrastive learning. However, restricting optimization to local in-batch samples creates a fundamental limitation we term the Gradient Locality Bottleneck (GLB), which prevents the resolution of acoustic ambiguities and hinders the learning of rare long-tail concepts. While external knowledge injection can break this bottleneck, it often triggers a problem called Representation-Drift Mismatch (RDM), where a static knowledge base becomes misaligned with evolving encoders, degrading guidance into noise. To address these intertwined challenges, we propose the Adaptive Self-improving Knowledge (ASK) framework. ASK breaks the GLB via multi-grained knowledge injection and mitigates RDM through a dynamic refinement strategy that synchronizes the knowledge base with the model. Additionally, an adaptive reliability weighting scheme is employed to filter retrieval noise based on cross-modal consistency. Extensive experiments across multiple benchmarks demonstrate that ASK consistently achieves new state-of-the-art performance across various backbones.

Siyuan Fu, Xuchen Guo, Mingjun Liu, Hongxiang Li, Boyin Tan, Gongxi Zhu, Xianwei Zhuang, Jinghan Ru, Yuxin Xie, Yuguo Yin• 2025

Related benchmarks

TaskDatasetResultRank
Audio-to-Text RetrievalClotho (test)
R@114.1
85
Text-to-Audio RetrievalClotho (test)
R@111.9
69
Audio RetrievalAudioCaps
R@143.7
50
Audio-to-Text RetrievalClotho
R@10.195
4
Showing 4 of 4 rows

Other info

Follow for update