Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

About

Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation. Therefore, we propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors. In the decoder, a Language-Guided Feature Selection (LGFS) module is proposed to flexibly integrate spatial and temporal features from the two encoders. We also propose a Cross-Modal Adaptive Modulation (CMAM) module to dynamically recombine spatial- and temporal-relevant linguistic features for multimodal feature interaction in each stage of the two encoders. Our method achieves new state-of-the-art performance on two popular benchmarks with less computational overhead than previous approaches.

Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, Fei Wang• 2021

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-YouTube-VOS (val)--
200
Video segmentation from a sentenceA2D Sentences (test)
Overall IoU66.2
122
Referring Video Object SegmentationJHMDB Sentences (test)
Overall IoU0.598
83
Referring Video SegmentationJHMDB Sentences (test)
mAP (0.5:0.95)33.5
35
Referring Video Object SegmentationA2D Sentences v1.0 (test)
IoU Overall66.2
26
Segmentation from a sentenceJ-HMDB Sentences (test)
P@0.50.783
20
Referring Video Object SegmentationA2D-S (test)
oIoU66.2
17
Referring Video SegmentationJHMDB Sentences
Precision @ 0.578.3
16
Actor and Action SegmentationA2D-S (val)
oIoU66.2
10
Actor and Action SegmentationJHMDB-S (val)
oIoU59.8
9
Showing 10 of 12 rows

Other info

Follow for update