Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

About

We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.

Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Audio GenerationAudioCaps (test)
KL Divergence1.29
195
Speech ReconstructionSeed-ZH
PESQ3.857
29
Audio ReconstructionAudioSet (test)
Mel Distance (16kHz)0.349
23
Audio UnderstandingX-Ares
ASV201591.44
21
Music UnderstandingX-Ares
FMA Score30.74
19
Speech UnderstandingX-Ares
CREMA-D Score38.8
19
Speech ReconstructionSeed-TTS English
PESQ3.668
17
Music ReconstructionMUSDB18
Mel-16k Score0.259
16
Audio UnderstandingAudio Understanding Evaluation Suite LS100h CD FSC LibCnt LSMF RAV VocS FMA GTZAN MT NSynth Clo DES ESC Urb8
LS100h Score0.00e+0
13
Text-to-Audio GenerationTTA-Bench Accuracy
CE Score3.388
10
Showing 10 of 13 rows

Other info

Follow for update