AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
About
Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been released on https://av-reasoner.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | -- | 425 | |
| Video Understanding | Video-MME | Overall Score56.8 | 92 | |
| Audio-visual understanding | DailyOmni | Average Score53.8 | 69 | |
| Video Understanding | LVBench | Average Score33.7 | 67 | |
| Audio-visual understanding | WorldSense | Accuracy44.6 | 42 | |
| Video Reasoning | Video-Holmes | Score39.6 | 34 | |
| Audio-visual understanding | IntentBench | Accuracy59.5 | 20 | |
| Video Understanding | TOMATO | Score24.9 | 18 | |
| Audio-Video Understanding | OmniBench | Score48.3 | 10 | |
| Audio-Video Understanding | AV-Counting | Primary Score23 | 10 |