Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ola: Pushing the Frontiers of Omni-Modal Language Model

About

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER1.9
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER4.2
1151
Automatic Speech RecognitionLibriSpeech (dev-other)
WER4.4
462
Automatic Speech RecognitionLibriSpeech (dev-clean)
WER (%)1.9
340
Audio-visual understandingDailyOmni
Average Score54.1
69
Audio-visual understandingWorldSense
Accuracy44.7
42
Audio-visual understandingDaily-Omni
Accuracy50.71
27
Multimodal Future PredictionFutureOmni 1.0 (Overall)
Accuracy (Cartoon)44.44
20
Audio-visual understandingIntentBench
Accuracy60.3
20
Audio Speech RecognitionLRS3
WER4.7
18
Showing 10 of 19 rows

Other info

Follow for update