Baichuan-Omni-1.5 Technical Report
About
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87.3 | 1455 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2.54 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER5.63 | 1151 | |
| Multimodal Understanding | MMBench | Accuracy85.6 | 637 | |
| Multimodal Understanding | MMMU | Accuracy53.9 | 437 | |
| Video Understanding | MVBench | Accuracy63.7 | 425 | |
| Video Question Answering | ActivityNet-QA | Accuracy62 | 376 | |
| Automatic Speech Recognition | AISHELL-1 (test) | CER2.37 | 97 | |
| Multimodal Perception | MME Perception | Perception Score1.63e+3 | 79 | |
| Automatic Speech Recognition | WenetSpeech Meeting (test) | CER9.86 | 78 |