Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aliasing-Free Neural Audio Synthesis

About

In neural audio synthesis, neural vocoders and codecs are models that reconstruct waveforms from acoustic and latent representations, which are essential to the resulting audio quality. While current models are capable of generating perceptually natural speech, they still struggle with high-fidelity music and singing voice synthesis, as severe aliasing artifacts are introduced by non-linear activation functions and upsampling layers in existing architectures. Although various anti-aliasing techniques have been proposed in digital signal processing, their integration into neural vocoders and codecs remains under-explored. This paper incorporates differentiable anti-aliasing techniques into the activation and upsampling modules to bridge this gap, and thus presents Pupu-Vocoder and Pupu-Codec. We build a test signal benchmark to evaluate the anti-aliased modules, and validate our proposed models on speech, singing voice, music, and audio. Experimental results show that Pupu-Vocoder and Pupu-Codec outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech. Demos, codes, and checkpoints are available at VocodexElysium.github.io/AliasingFreeNeuralAudioSynthesis/.

Yicheng Gu, Junan Zhang, Chaoren Wang, Jerry Li, Zhizheng Wu, Lauri Juvela• 2025

Related benchmarks

TaskDatasetResultRank
Analysis-synthesisMusic Academic
FAD0.017
24
Audio SynthesisSinging Voice MUSHRA (evaluation)
MUSHRA Score82.64
21
Audio SynthesisSinging Voice Academic setting
MOS Prediction Score4.17
21
Audio SynthesisSinging Voice Industrial setting
MOS Prediction4.33
21
Analysis-synthesisMusic Industrial
FAD0.033
12
Analysis-synthesisAudio Industrial
FAD0.018
12
Singing Voice SynthesisSinging Voice Academic setting
MOS Prediction Score4.17
11
Singing Voice SynthesisSinging Voice Industrial setting
MOS Prediction4.33
11
Speech SynthesisSpeech Academic Setting
MOS Prediction3.61
11
Speech SynthesisSpeech Industrial Setting
MOS Prediction4.29
11
Showing 10 of 10 rows

Other info

Follow for update