EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
About
We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Reconstruction | AudioSet (test) | Mel Distance (16kHz)0.349 | 23 | |
| Speech Reconstruction | Seed-ZH | PESQ3.857 | 21 | |
| Audio Understanding | X-Ares | ASV201591.44 | 21 | |
| Music Understanding | X-Ares | FMA Score30.74 | 19 | |
| Speech Understanding | X-Ares | CREMA-D Score38.8 | 19 | |
| Text-to-Audio Generation | TTA-Bench Accuracy | CE Score3.388 | 10 | |
| Speech Reconstruction | Seed-TTS English | PESQ3.668 | 9 | |
| Text-to-Audio | AudioSet Strong | F1 Event10.43 | 9 | |
| Music Reconstruction | MUSDB18 | Mel-16k Score0.258 | 8 | |
| Text-to-Audio | Text-to-Audio (test) | Loudness MAE11.03 | 7 |