Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
About
Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF-101 | Top-1 Acc99.7 | 225 | |
| Action Recognition | Kinetics-600 | Top-1 Acc98.2 | 97 | |
| Action Recognition | UCF-101 | Accuracy99.7 | 60 | |
| Action Recognition | HMDB-51 | Accuracy92.5 | 55 | |
| Action Recognition | HMDB-51 | Base Accuracy92.3 | 51 | |
| Action Recognition | UCF-101 | Base Accuracy99.6 | 44 | |
| Action Recognition | Kinetics-400 | Base Accuracy96.3 | 42 | |
| Action Recognition | K400 | Top-1 Accuracy96.7 | 39 | |
| Action Recognition | Something-Something v2 | Base Score19.2 | 26 | |
| Action Recognition | HMDB51 (full) | Top-1 Accuracy90.1 | 15 |