Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cacophony: An Improved Contrastive Audio-Text Model

About

Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then train a contrastive model with an auxiliary captioning objective with the audio encoder initialized from the MAE model. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.

Ge Zhu, Jordan Darefsky, Zhiyao Duan• 2024

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy97
325
Text-to-Audio RetrievalAudioCaps (test)
Recall@141
145
Audio CaptioningAudioCaps (test)
CIDEr72.8
140
Audio-to-Text RetrievalClotho (test)
R@126.5
78
Audio ClassificationSPC V2
Accuracy92.2
65
Audio-to-Text RetrievalAudioCaps (test)
R@155.3
62
Text-to-Audio RetrievalClotho (test)
R@120.2
62
Audio CaptioningClotho
CIDEr34.2
60
Audio ClassificationGTZAN
Accuracy85
54
Audio CaptioningAudioCaps
CIDEr72.8
47
Showing 10 of 15 rows

Other info

Follow for update