WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

About

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang• 2023

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97.2	461
Audio Captioning	AudioCaps (test)	CIDEr78.7	222
Text-to-Audio Retrieval	AudioCaps (test)	Recall@142.2	191
Audio Classification	Urbansound8K	Accuracy80.6	126
Musical Instrument Classification	NSynth	Accuracy74.4	123
Audio Classification	ESC-50 (test)	Accuracy94.25	111
Audio-to-Text Retrieval	Clotho (test)	R@126.9	92
Text-to-Audio Retrieval	Clotho (test)	R@121.11	85
Audio Classification	VGG-Sound	--	83
Audio Captioning	Clotho	CIDEr48.8	82

Showing 10 of 62 rows

Other info

Follow for update

@wizwand_team Discord