Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation
About
Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Video Object Segmentation | Ref-DAVIS 2017 (val) | J&F50.02 | 178 | |
| Video segmentation from a sentence | A2D Sentences (test) | Overall IoU71.4 | 122 | |
| Referring Video Segmentation | Refer-Youtube-VOS (val) | J Index48.44 | 44 | |
| Referring Video Segmentation | JHMDB Sentences | Precision @ 0.587.4 | 16 |