Exploring Train and Test-Time Augmentations for Audio-Language Learning

About

In this paper, we aim to unveil the impact of data augmentation in audio-language multi-modal learning, which has not been explored despite its importance. We explore various augmentation methods at not only train-time but also test-time and find out that proper data augmentation can lead to substantial improvements. Specifically, applying our proposed audio-language paired augmentation PairMix, which is the first multi-modal audio-language augmentation method, outperforms the baselines for both automated audio captioning and audio-text retrieval tasks. To fully take advantage of data augmentation, we also present multi-level test-time augmentation (Multi-TTA) for the test-time. We successfully incorporate the two proposed methods and uni-modal augmentations and achieve 47.5 SPIDEr on audio captioning, which is an 18.2% relative increase over the baseline. In audio-text retrieval, the proposed methods also show an improvement in performance as well.

Eungbeom Kim, Jinhee Kim, Yoori Oh, Kyungsu Kim, Minju Park, Jaeheon Sim, Jinwoo Lee, Kyogu Lee• 2022

Related benchmarks

Task	Dataset	Result
Text-to-Audio Retrieval	AudioCaps (test)	Recall@134.7	180
Audio Captioning	AudioCaps (test)	CIDEr76.9	157
Audio Captioning	AudioCaps	CIDEr76.9	66
Cross-modal retrieval	AudioCaps (test)	R@140.2	23

Showing 4 of 4 rows

Other info

Code

Follow for update

@wizwand_team Discord