Active Perception Agent for Omnimodal Audio-Video Understanding

About

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often face challenges in fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, to our best knowledge, the first fully active perception agent that dynamically orchestrates specialized unimodal tools to achieve more fine-grained omnimodal reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, we demonstrate a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and closed-source models by substantial margins of 10% - 20% accuracy without training.

Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang• 2025

Related benchmarks

Task	Dataset	Result
Audio-visual understanding	DailyOmni	Average Score82.71	83
Audio-Video Understanding	OmniVideoBench	Avg Latency59.1	23
Long Audio-Video Question Answering	WorldSense	Average Accuracy61.2	18
Video Reasoning	MOV-Bench	Causal Accuracy45.05	13
Video Reasoning	OmniVideoBench	Accuracy (Long)28.45	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord