Cacophony: An Improved Contrastive Audio-Text Model

About

Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then train a contrastive model with an auxiliary captioning objective with the audio encoder initialized from the MAE model. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.

Ge Zhu, Jordan Darefsky, Zhiyao Duan• 2024

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97	441
Text-to-Audio Retrieval	AudioCaps (test)	Recall@141	180
Audio Captioning	AudioCaps (test)	CIDEr72.8	157
Audio Classification	Urbansound8K	Accuracy77.1	126
Audio-to-Text Retrieval	Clotho (test)	R@126.5	85
Audio Classification	VGG-Sound	--	83
Audio Captioning	Clotho	CIDEr34.2	82
Text-to-Audio Retrieval	Clotho (test)	R@120.2	78
Audio-to-Text Retrieval	AudioCaps (test)	R@155.3	69
Audio Captioning	AudioCaps	CIDEr72.8	66

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord