Ming-Omni: A Unified Multimodal Model for Perception and Generation
About
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER3.54 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.9 | 833 | |
| Automatic Speech Recognition | WenetSpeech Meeting (test) | CER5.96 | 45 | |
| Instruction Following | IFEval (test) | IFEval Score53.68 | 45 | |
| Audio Understanding | MMAR (test) | Performance45.4 | 20 | |
| Automatic Speech Recognition | Fleurs en (test) | WER5.82 | 17 | |
| Paralinguistic | MMAU (test) | Performance63.52 | 12 | |
| Knowledge | OpenBookQA (test) | Accuracy69.67 | 11 | |
| Knowledge | MMSU (test) | Performance47 | 11 | |
| Automatic Speech Recognition | WenetSpeech (test-net) | CER6.26 | 10 |