MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
About
Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU45.9 | 2731 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy83.4 | 1866 | |
| Action Recognition | Something-something v1 (val) | Top-1 Acc57.4 | 257 | |
| Video Classification | Kinetics 400 (val) | Top-1 Acc80.8 | 204 | |
| Action Recognition | Something-Something v2 (test val) | Top-1 Accuracy70.1 | 187 |