Duplex Diffusion Models Improve Speech-to-Speech Translation

About

Speech-to-speech translation is a typical sequence-to-sequence learning task that naturally has two directions. How to effectively leverage bidirectional supervision signals to produce high-fidelity audio for both directions? Existing approaches either train two separate models or a multitask-learned model with low efficiency and inferior performance. In this paper, we propose a duplex diffusion model that applies diffusion probabilistic models to both sides of a reversible duplex Conformer, so that either end can simultaneously input and output a distinct language's speech. Our model enables reversible speech translation by simply flipping the input and output ends. Experiments show that our model achieves the first success of reversible speech translation with significant improvements of ASR-BLEU scores compared with a list of state-of-the-art baselines.

Xianchao Wu• 2023

Related benchmarks

Task	Dataset	Result
Speech-to-speech translation	Fisher Spanish-English (test)	BLEU (Speech Input)59.1	55
Speech-to-speech translation	Fisher Spanish-English (dev)	BLEU (Speech)58.9	48
Speech-to-speech translation	Fisher Spanish-English (dev2)	ASR BLEU59.8	36
Speech-to-speech translation	Europarl-ST En->Es (test)	ASR-BLEU37.2	10
Speech-to-speech translation	MuST-C En-Es (test)	ASR-BLEU34.5	10
Speech-to-speech translation	CoVOST-2 Es->En (test)	ASR-BLEU37.1	10
Speech-to-speech translation	Europarl-ST Es->En (test)	ASR-BLEU34	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord