Scaling Open-Vocabulary Action Detection
About
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work. Our code is available at https://siatheindochinese.github.io/sia_act_page/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Detection | UCF-101-24 (test) | F1 Score (IoU=0.5)88.5 | 15 | |
| Action Detection | JHMDB (test) | F@0.557.1 | 11 | |
| Action Detection | JHMDB closed-set | F@0.588.5 | 7 | |
| Action Detection | MultiSports (test) | F1 Score @ IoU 0.51.3 | 6 | |
| Action Detection | UCF-MAMA (test) | F1 Score (IoU=0.5)0.6 | 6 | |
| Action Detection | JHMDB (75-25 Split) | Novel@0.583.2 | 3 | |
| Action Detection | MultiSports closed-set | F1 Score @ IoU 0.528.8 | 3 | |
| Action Detection | UCF-101-24 (75-25 split) | Novel@0.597.1 | 2 | |
| Action Detection | UCF-101-24 (50-50 Split) | Novel @0.575.1 | 2 | |
| Action Detection | JHMDB (50-50 Split) | Novel@0.561 | 2 |