Baichuan-Omni-1.5 Technical Report
About
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Question Answering | MMAR | Sd Score41.21 | 17 | |
| Long-Form Speech Understanding | AudioMarathon 1.0 (test) | Average Score38.4 | 16 | |
| Speech Content Extraction | AudioMarathon 1.0 (test) | SER12.4 | 15 | |
| Speaker Information Modeling | AudioMarathon 1.0 (test) | SD (Score)49.2 | 15 | |
| Audio Classification | AudioMarathon 1.0 (test) | SED Score45.7 | 15 | |
| Spoken Intelligence Evaluation | LLM_Voice 1.0 (test) | Remembering Score44.8 | 13 | |
| Audio Understanding | MMAR | MMAR40.7 | 12 | |
| Audio-Video Understanding | OmniVideoBench | Latency (0-1s Bin)28.92 | 9 | |
| Spoken Question Answering | UltraEval-Audio S2S | AlpacaEval Score0.5869 | 9 | |
| Speech-to-Text Spoken Question Answering | OpenAudioBench S2T (test) | AlpacaEval Score77.9 | 7 |