Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning

About

Retrieval-augmented generation can improve audio captioning by incorporating relevant audio-text pairs from a knowledge base. Existing methods typically rely solely on the input audio as a unimodal retrieval query. In contrast, we propose Generation-Assisted Multimodal Querying, which generates a text description of the input audio to enable multimodal querying. This approach aligns the query modality with the audio-text structure of the knowledge base, leading to more effective retrieval. Furthermore, we introduce a novel progressive learning strategy that gradually increases the number of interleaved audio-text pairs to enhance the training process. Our experiments on AudioCaps, Clotho, and Auto-ACD demonstrate that our approach achieves state-of-the-art results across these benchmarks.

Choi Changin, Lim Sungjun, Rhee Wonjong• 2024

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy95.25	461
Audio Captioning	AudioCaps (test)	CIDEr84.5	222
Audio Classification	Urbansound8K	Accuracy78.39	126
Audio Captioning	Clotho 2.1 (test)	SPICE0.143	75
Audio Classification	GTZAN	Accuracy68.07	65
Cross-modal retrieval	Clotho (test)	R@131	29
Cross-modal retrieval	AudioCaps (test)	R@159.1	23
Automated Audio Captioning	Clotho 2.1 (evaluation)	SPIDEr31.9	12
Audio Captioning	Auto-ACD (test)	CIDEr70.4	6
Audio Classification	TUT17	Accuracy38.7	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord