Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

About

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.

Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Audio GenerationAudioCaps (test)
KL Divergence1.87
195
Text-to-SpeechLibriTTS (test)--
16
Environment-aware Text-to-SpeechAudioCaps (test)
WER11.68
11
Environment-aware Text-to-SpeechSeed-TTS AudioCaps en (test)
WER7.08
6
Intelligible Audio GenerationAC-Filtered (test)
CLAP Score0.22
6
Environment-aware Text-to-SpeechSeed-TTS en and AudioCaps augmented (test)
WER7.08
5
Environment-aware Text-to-SpeechSeed-TTS and AudioCaps en (test)
S-MOS3.15
4
Environment-aware Text-to-SpeechLibriTTS and AudioCaps (test)
WER11.08
4
Showing 8 of 8 rows

Other info

Follow for update