Kimi-Audio Technical Report
About
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER2.57 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.28 | 833 | |
| Automatic Speech Recognition | Librispeech (test-clean) | WER2.1 | 84 | |
| Automatic Speech Recognition | LibriSpeech Other | WER2.42 | 75 | |
| Emotion Recognition | IEMOCAP | Accuracy57.72 | 71 | |
| Automatic Speech Recognition | WenetSpeech Meeting (test) | CER6.38 | 45 | |
| Instruction Following | IFEval (test) | IFEval Score47.91 | 45 | |
| Automatic Speech Recognition | GigaSpeech (test) | WER16.06 | 40 | |
| Audio Understanding | MMAU v05.15.25 (test-mini) | Sound Score75.68 | 28 | |
| Audio Understanding | MMAU (test) | Speech Score62.16 | 25 |