Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Universal Speech Content Factorization

About

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola Garc\'ia-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner• 2026

Related benchmarks

TaskDatasetResultRank
Voice ConversionLibriSpeech (test-clean and test-other)
WER2.31
8
Voice ConversionLibriSpeech (test-clean source, test-other target)
MOS3.66
7
Showing 2 of 2 rows

Other info

Follow for update