Kimi-Audio Technical Report
About
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.28 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER2.4 | 1151 | |
| Instruction Following | IFEval | -- | 625 | |
| Emotion Recognition | IEMOCAP | Accuracy57.72 | 115 | |
| Speech Recognition | Librispeech other (test) | WER2.42 | 105 | |
| Automatic Speech Recognition | LibriSpeech Other | WER2.42 | 96 | |
| Automatic Speech Recognition | Librispeech (test-clean) | WER2.1 | 84 | |
| Automatic Speech Recognition | WenetSpeech Meeting (test) | CER6.38 | 78 | |
| Automatic Speech Recognition | WenetSpeech Net (test) | -- | 57 | |
| Instruction Following | IFEval (test) | IFEval Score47.91 | 55 |