Ola: Pushing the Frontiers of Omni-Modal Language Model
About
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.9 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.2 | 1151 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER4.4 | 462 | |
| Automatic Speech Recognition | LibriSpeech (dev-clean) | WER (%)1.9 | 340 | |
| Audio-visual understanding | DailyOmni | Average Score54.1 | 69 | |
| Audio-visual understanding | WorldSense | Accuracy44.7 | 42 | |
| Audio-visual understanding | Daily-Omni | Accuracy50.71 | 27 | |
| Multimodal Future Prediction | FutureOmni 1.0 (Overall) | Accuracy (Cartoon)44.44 | 20 | |
| Audio-visual understanding | IntentBench | Accuracy60.3 | 20 | |
| Audio Speech Recognition | LRS3 | WER4.7 | 18 |