Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

About

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha• 2023

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy84.98
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy90.38
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.9289
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy85.91
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy85.83
300
Referring Image SegmentationRefCOCO (val)
mIoU76.94
259
Referring Expression SegmentationRefCOCO (testA)
cIoU78.5
257
Referring Image SegmentationRefCOCO+ (test-B)
mIoU66.73
252
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy77.97
244
Referring Expression SegmentationRefCOCO+ (testA)
cIoU75.71
230
Showing 10 of 75 rows
...

Other info

Code

Follow for update