Ola: Pushing the Frontiers of Omni-Modal Language Model

About

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.9	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER4.2	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER4.4	486
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)1.9	340
Audio-visual understanding	DailyOmni	Average Score54.1	83
Audio-visual understanding	WorldSense	Accuracy44.7	72
Audio-visual understanding	Daily-Omni	Accuracy52.3	58
Video Reasoning	Video-MME	--	55
Multimodal Future Prediction	FutureOmni 1.0 (Overall)	Accuracy (Cartoon)44.44	20
Audio-visual understanding	IntentBench	Accuracy60.3	20

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord