Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

About

Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate the proposed method's effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion.

Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach• 2024

Related benchmarks

TaskDatasetResultRank
Agitation score predictionBridge2AI (speaker-independent CV)
Pearson Correlation (ρ)0.089
21
Agitation predictionBridge2AI-Voice 5-fold speaker-independent CV v3.0.0
Pearson Correlation (ρ)0.089
16
Identity LeakageBridge2AI 120-speaker
Top-1 Accuracy28
11
Anger detectionCREMA-D anger detection
AUC-ROC0.74
7
Showing 4 of 4 rows

Other info

Follow for update