Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

QuarkAudio Technical Report

About

Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task-specific conditional information as the conditioning sequence of the decoder-only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language-queried audio source separation (LASS). In addition, we extend downstream tasks to universal free-form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H-Codec achieves high-quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state-of-the-art task-specific or multi-task systems across multiple tasks.

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Xiaofu Chen, Bin Gong, Zheng Xue, Gang Song• 2025

Related benchmarks

TaskDatasetResultRank
Speech ReconstructionLibrispeech (test-clean)
STOI0.94
49
Speech SeparationLibri2Mix (test)--
45
Audio ReconstructionAudioSet (eval)
Mel Distance1.1125
35
Voice ConversionVCTK
WER3.02
21
Speech ReconstructionSeed-ZH
PESQ2.88
12
Speech ReconstructionSeed EN
PESQ2.77
12
Packet Loss ConcealmentICASSP PLC-challenge 2022 (test)
PLCMOS Score4.58
9
Target Speaker ExtractionLibri2Mix Clean (test)
DNSMOS SIG3.62
9
Audio ReconstructionMUSDB18 HQ (test)
Mel Loss0.5035
7
Speech Semantic EditingSeed-zh (test)
WER11.275
3
Showing 10 of 12 rows

Other info

Follow for update