Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MOSS-Audio Technical Report

About

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Jun Zhan, Kang Yu, Kexin Huang, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Yang Gao, Yiyang Zhang, Xipeng Qiu• 2026

Related benchmarks

TaskDatasetResultRank
Audio UnderstandingMMAU
Accuracy77.64
54
Audio UnderstandingMMAU-Pro
Average Score64.92
42
Audio ReasoningMMAR
Average Accuracy66.53
38
Speech UnderstandingMMSU
Accuracy75.52
16
Automatic Speech RecognitionASR summary results 12 evaluation dimensions
Health Condition Error Rate19.18
11
Speech captioningSpeech Captioning (Evaluation Set)
Gender Score4.697
7
Timestamp ASRLibrispeech (EN)
AAS131.6
5
Timestamp ASRAISHELL-1 zh
AAS Score76.96
4
Showing 8 of 8 rows

Other info

Follow for update