Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

About

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha• 2023

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy84.98
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy90.38
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.9289
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy85.91
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy85.83
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy77.97
235
Referring Expression SegmentationRefCOCO (testA)
cIoU78.5
217
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy89.77
207
Referring Expression SegmentationRefCOCO+ (val)
cIoU72.2
201
Referring Image SegmentationRefCOCO+ (test-B)
mIoU66.73
200
Showing 10 of 69 rows

Other info

Code

Follow for update