Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

QuarkAudio Technical Report

About

Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task-specific conditional information as the conditioning sequence of the decoder-only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language-queried audio source separation (LASS). In addition, we extend downstream tasks to universal free-form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H-Codec achieves high-quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state-of-the-art task-specific or multi-task systems across multiple tasks.

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Xiaofu Chen, Bin Gong, Zheng Xue, Gang Song• 2025

Related benchmarks

TaskDatasetResultRank
Audio ReconstructionAudioSet (eval)
Mel Distance1.1125
63
Speech SeparationLibri2Mix (test)--
60
Speech ReconstructionLibrispeech (test-clean)
UT MOS4.08
59
Speech ReconstructionSeed-ZH
PESQ2.88
21
Voice ConversionVCTK
WER3.02
21
Speech ReconstructionSeed EN
PESQ2.77
12
Packet Loss ConcealmentICASSP PLC-challenge 2022 (test)
PLCMOS Score4.58
9
Target Speaker ExtractionLibri2Mix Clean (test)
DNSMOS SIG3.62
9
Audio ReconstructionMUSDB18 HQ (test)
Mel Loss0.5035
7
Speech Semantic EditingSeed-zh (test)
WER11.275
3
Showing 10 of 12 rows

Other info

Follow for update