Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

About

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been released on https://av-reasoner.github.io.

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
425
Video UnderstandingVideo-MME
Overall Score56.8
92
Audio-visual understandingDailyOmni
Average Score53.8
69
Video UnderstandingLVBench
Average Score33.7
67
Audio-visual understandingWorldSense
Accuracy44.6
42
Video ReasoningVideo-Holmes
Score39.6
34
Audio-visual understandingIntentBench
Accuracy59.5
20
Video UnderstandingTOMATO
Score24.9
18
Audio-Video UnderstandingOmniBench
Score48.3
10
Audio-Video UnderstandingAV-Counting
Primary Score23
10
Showing 10 of 11 rows

Other info

Follow for update