Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

About

We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass• 2025

Related benchmarks

TaskDatasetResultRank
Audio ReasoningMMAR (test)
Average Score61.2
57
Audio UnderstandingMMAU v05.15.25 (test-mini)
Sound Score81.7
54
Audio Question AnsweringMMAR
Average Score63.46
47
Audio ReasoningMMAR
Average Accuracy63.4
38
Audio UnderstandingMMAU (test)--
25
Audio UnderstandingMMAU mini original (test)
Accuracy (Sound Domain)73.6
21
Audio UnderstandingMMAU mini (test)
Accuracy77
20
Audio ReasoningMMAU mini 1.0 (test)
Sound Score81.7
15
Multimodal Audio UnderstandingMMAU Mini
Sound Score81.7
13
Audio Perception and ReasoningMMAR within CAFE framework (overall)
Perception Accuracy51.21
13
Showing 10 of 10 rows

Other info

Follow for update