Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

About

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li• 2025

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF-101
Top-1 Acc99.7
225
Action RecognitionKinetics-600
Top-1 Acc98.2
97
Action RecognitionUCF-101
Accuracy99.7
60
Action RecognitionHMDB-51
Accuracy92.5
55
Action RecognitionHMDB-51
Base Accuracy92.3
51
Action RecognitionUCF-101
Base Accuracy99.6
44
Action RecognitionKinetics-400
Base Accuracy96.3
42
Action RecognitionK400
Top-1 Accuracy96.7
39
Action RecognitionSomething-Something v2
Base Score19.2
26
Action RecognitionHMDB51 (full)
Top-1 Accuracy90.1
15
Showing 10 of 10 rows

Other info

Follow for update