MiMo-Audio: Audio Language Models are Few-Shot Learners
About
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | Librispeech (test-clean) | WER3.5 | 84 | |
| Automatic Speech Recognition | LibriSpeech Other | WER16.22 | 75 | |
| Automatic Speech Recognition | LibriSpeech Clean | WER3.5 | 57 | |
| Text-to-Speech | Seed-TTS zh (test) | WER0.0196 | 47 | |
| Text-to-Speech | Seed-TTS (eval) | WER5.37 | 39 | |
| Audio Reconstruction | AudioSet (eval) | Mel Distance0.66 | 35 | |
| Audio Understanding | MMAU v05.15.25 (test-mini) | Sound Score81.68 | 28 | |
| Audio Understanding | MMAU v05.15.25 (test) | Sound Score81.68 | 28 | |
| Audio Reconstruction | MusicDB (test) | Mel Distance0.65 | 28 | |
| Speech Reconstruction | AISHELL-2 Chinese | SIM0.85 | 27 |