Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

About

The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this, recent works integrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL (OV-TAL). However, despite the success of VLMs trained on extensive datasets, existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers, limiting their generalizability. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we identify limitations in existing OV-TAL evaluation schemes and propose a new benchmark for thorough assessment. Finally, we showcase the TAL performance of the large multimodal model Gemini-1.5 on our new benchmark. Code is released at https://github.com/HYUNJS/STOV-TAL.

Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim• 2024

Related benchmarks

TaskDatasetResultRank
Temporal Action LocalizationTHUMOS14 v1.0 (50%-50%)
mAP (Avg)48.8
17
Temporal Action LocalizationActivityNet 1.3 (50%-50%)
Avg mAP29.6
17
Temporal Action DetectionTHUMOS 50% Seen / 50% Unseen 14
mAP@0.356.3
11
Temporal Action DetectionActivityNet v1.3 (50% Seen 50% Unseen)
mAP@0.5048.4
11
Temporal Action DetectionActivityNet 75% Seen / 25% Unseen v1.3
mAP @ IoU=0.552
11
Temporal Action DetectionTHUMOS 75% Seen / 25% Unseen 14
mAP@0.359.5
11
Showing 6 of 6 rows

Other info

Follow for update