OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

About

Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen• 2023

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101 (test)	--	376
Action Recognition	HMDB51 (test)	--	249
Action Recognition	UCF-101	Top-1 Acc93.9	225
Video Recognition	HMDB51	Accuracy67.3	145
Video Action Classification	Something-Something v2	Top-1 Acc12.6	145
Action Recognition	SSV2	Top-1 Acc12.2	142
Video Recognition	UCF101	Accuracy93.9	111
Action Recognition	Kinetics-600	Top-1 Acc75.1	97
Action Recognition	Kinetics-600 (test)	Top-1 Accuracy75.1	84
Video Classification	Kinetics-600 (val)	--	84

Showing 10 of 35 rows

Other info

Code

Follow for update

@wizwand_team Discord