Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Active Perception Agent for Omnimodal Audio-Video Understanding

About

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often face challenges in fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, to our best knowledge, the first fully active perception agent that dynamically orchestrates specialized unimodal tools to achieve more fine-grained omnimodal reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, we demonstrate a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and closed-source models by substantial margins of 10% - 20% accuracy without training.

Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang• 2025

Related benchmarks

TaskDatasetResultRank
Audio-visual understandingDailyOmni
Average Score82.71
49
Long Audio-Video Question AnsweringWorldSense
Average Accuracy61.2
18
Audio-Video UnderstandingOmniVideoBench
Latency (0-1s Bin)66.08
9
Showing 3 of 3 rows

Other info

Follow for update