TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition
About
Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for self-supervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative RecogTrans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 (3 splits) | Accuracy90.7 | 155 | |
| Video Action Recognition | HMDB-51 (3 splits) | Accuracy64.2 | 116 | |
| Video Retrieval | UCF101 (1) | Top-1 Acc54 | 92 | |
| Video Retrieval | HMDB51 (first split) | Top-1 Accuracy25.5 | 49 | |
| Action Classification | HMDB51 1.0 (fine-tuned) | Accuracy60.1 | 16 | |
| Action Classification | UCF101 1.0 (fine-tuned) | Accuracy87.8 | 16 |