Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

About

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, with the non-trivial REs annotated with seven RE semantic categories. We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided VOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.

Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, Xavier Giro-i-Nieto• 2020

Related benchmarks

TaskDatasetResultRank
Referring Image SegmentationRefCOCO+ (test-B)
mIoU36.17
200
Referring Image SegmentationRefCOCO (val)--
197
Referring Image SegmentationRefCOCO (test A)--
178
Video segmentation from a sentenceA2D Sentences (test)
Overall IoU67.2
122
Referring Image SegmentationRefCOCO (test-B)--
119
Referring Image SegmentationRefCOCO+ (val)--
117
Referring Image SegmentationRefCOCO+ (test-A)--
89
Language-guided Video Object SegmentationDAVIS 1st frame Referring Expressions 2017 (val)
J&F Score44.5
6
Language-guided Video Object SegmentationA2D (test)
Precision @0.557.8
5
Language-guided Video Object SegmentationDAVIS full video Referring Expressions 2017 (val)
J&F Score45.1
5
Showing 10 of 11 rows

Other info

Code

Follow for update