Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

About

Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.

Edson Araujo, Saurabhchand Bhati, M. Jehanzeb Mirza, Brian Kingsbury, Samuel Thomas, Rogerio Feris, James R. Glass, Hilde Kuehne• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringAVQA
Accuracy91.1
85
Audio-visual understandingDaily-Omni
Accuracy54.4
58
Video ReasoningVideo-MME--
55
Audio ReasoningMMAR
Average Accuracy59.1
38
Audio-Visual ReasoningOmniBench
Accuracy57.1
16
Audio-Visual ReasoningRiva Academic
Accuracy50.7
9
Audio-Visual ReasoningRiva (StandUp)
Accuracy75.3
9
Audio ReasoningMMAU
Accuracy75.4
7
Showing 8 of 8 rows

Other info

Follow for update