Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Local and Global Temporal Contexts for Video Semantic Segmentation

About

Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by k-means. The information within those prototypes is mined by CFM to refine target features. Experimental results on popular benchmarks demonstrate that CFFM and CFFM++ perform favorably against state-of-the-art methods. Our code is available at https://github.com/GuoleiSun/VSS-CFFM

Guolei Sun, Yun Liu, Henghui Ding, Min Wu, Luc Van Gool• 2022

Related benchmarks

TaskDatasetResultRank
Video Semantic SegmentationVSPW (val)
mIoU50.1
121
Video Semantic SegmentationCityscapes (val)
mIoU75.7
103
Video Semantic SegmentationVSPW
mIoU50.1
52
Video Semantic SegmentationCamVid
mIoU62.3
41
Video Semantic SegmentationNYU V2
mIoU46.7
27
Semantic Video SegmentationCityscapes (test)
mIoU75.7
24
Video Semantic SegmentationVSPW 17 (test)
mIoU42
20
Video Semantic SegmentationVSPW W2F protocol (10% warm-up ratio)
mIoU49.6
9
Video Semantic SegmentationVSPW W2F protocol 25% warm-up ratio
mIoU49.7
9
Video Semantic SegmentationVSPW 50% warm-up ratio W2F protocol
mIoU49.7
9
Showing 10 of 10 rows

Other info

Code

Follow for update