VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

About

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.

Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Audio Generation	AudioCaps (test)	KL Divergence1.87	213
Text-to-Speech	LibriTTS (test)	--	16
Environment-aware Text-to-Speech	AudioCaps (test)	WER11.68	11
Environment-aware Text-to-Speech	Seed-TTS AudioCaps en (test)	WER7.08	6
Intelligible Audio Generation	AC-Filtered (test)	CLAP Score0.22	6
Environment-aware Text-to-Speech	Seed-TTS en and AudioCaps augmented (test)	WER7.08	5
Environment-aware Text-to-Speech	Seed-TTS and AudioCaps en (test)	S-MOS3.15	4
Environment-aware Text-to-Speech	LibriTTS and AudioCaps (test)	WER11.08	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord