VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
About
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Generation | AudioCaps (test) | KL Divergence1.87 | 195 | |
| Text-to-Speech | LibriTTS (test) | -- | 16 | |
| Environment-aware Text-to-Speech | AudioCaps (test) | WER11.68 | 11 | |
| Environment-aware Text-to-Speech | Seed-TTS AudioCaps en (test) | WER7.08 | 6 | |
| Intelligible Audio Generation | AC-Filtered (test) | CLAP Score0.22 | 6 | |
| Environment-aware Text-to-Speech | Seed-TTS en and AudioCaps augmented (test) | WER7.08 | 5 | |
| Environment-aware Text-to-Speech | Seed-TTS and AudioCaps en (test) | S-MOS3.15 | 4 | |
| Environment-aware Text-to-Speech | LibriTTS and AudioCaps (test) | WER11.08 | 4 |