Cacophony: An Improved Contrastive Audio-Text Model
About
Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then train a contrastive model with an auxiliary captioning objective with the audio encoder initialized from the MAE model. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy97 | 325 | |
| Text-to-Audio Retrieval | AudioCaps (test) | Recall@141 | 145 | |
| Audio Captioning | AudioCaps (test) | CIDEr72.8 | 140 | |
| Audio-to-Text Retrieval | Clotho (test) | R@126.5 | 78 | |
| Audio Classification | SPC V2 | Accuracy92.2 | 65 | |
| Audio-to-Text Retrieval | AudioCaps (test) | R@155.3 | 62 | |
| Text-to-Audio Retrieval | Clotho (test) | R@120.2 | 62 | |
| Audio Captioning | Clotho | CIDEr34.2 | 60 | |
| Audio Classification | GTZAN | Accuracy85 | 54 | |
| Audio Captioning | AudioCaps | CIDEr72.8 | 47 |