Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

About

Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial interpretations and refining responses when facing ambiguous multimodal inputs. These results suggest that lightweight reinforcement learning fine-tuning enhances cross-modal reasoning in MLLMs. EchoInk-R1 is the first framework to unify audio, visual, and textual modalities for general open-world reasoning via reinforcement learning. Code and data are publicly released to facilitate further research.

Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, Pheng-Ann Heng• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
425
Video UnderstandingVideo-MME
Overall Score60.8
92
Audio-visual understandingDailyOmni
Average Score46.2
69
Video UnderstandingLVBench
Average Score37.6
67
Audio-visual understandingWorldSense
Accuracy45.7
42
Video ReasoningVideo-Holmes
Score42.5
34
Audio-visual understandingIntentBench
Accuracy63.6
20
Video UnderstandingTOMATO
Score29.9
18
Audio-Video UnderstandingAV-Counting
Primary Score22.7
10
Audio-Video UnderstandingAV-Odyssey
Score31.1
10
Showing 10 of 11 rows

Other info

Follow for update