Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

About

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li• 2025

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF-101	Top-1 Acc99.7	225
Action Recognition	Kinetics-600	Top-1 Acc98.2	97
Action Recognition	UCF-101	Accuracy99.7	60
Action Recognition	HMDB-51	Accuracy92.5	55
Action Recognition	HMDB-51	Base Accuracy92.3	51
Action Recognition	UCF-101	Base Accuracy99.6	44
Action Recognition	Kinetics-400	Base Accuracy96.3	42
Action Recognition	K400	Top-1 Accuracy96.7	39
Action Recognition	Something-Something v2	Base Score19.2	26
Action Recognition	HMDB51 (full)	Top-1 Accuracy90.1	15

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord