Audio Retrieval with Natural Language Queries: A Benchmark Study

About

The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the SoundDescs dataset are publicly available at https://github.com/akoepke/audio-retrieval-benchmark.

A. Sophia Koepke, Andreea-Maria Oncescu, Jo\~ao F. Henriques, Zeynep Akata, Samuel Albanie• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Audio Retrieval	AudioCaps (test)	Recall@139.6	191
Audio-to-Text Retrieval	Clotho	R@17	49
Text-to-Audio Retrieval	Clotho	R@10.067	31
Cross-modal retrieval	Clotho (test)	R@17	29
Audio-to-Text Retrieval	SoundDescs	R@131.4	10
Text-to-Audio Retrieval	AudioCaps 1K 1.0 (test)	Recall@136.1	10
Text-to-Audio Retrieval	Clotho 1K 1.0 (test)	R@16.5	10
Audio-to-Text Retrieval	AudioCaps 1K 1.0 (test)	R@139.6	8
Audio-to-Text Retrieval	Clotho 1K 1.0 (test)	R@16.3	8
Text-to-Audio Retrieval	SoundDescs	Recall@130.7	4

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord