PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

About

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha• 2023

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy84.98	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy90.38	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.9289	346
Referring Expression Segmentation	RefCOCO (testA)	cIoU78.5	315
Referring Expression Comprehension	RefCOCOg (test)	Accuracy85.91	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy85.83	300
Referring Expression Segmentation	RefCOCO+ (testA)	cIoU75.71	288
Referring Image Segmentation	RefCOCO (val)	mIoU76.94	274
Referring Expression Segmentation	RefCOCO+ (val)	cIoU72.2	272
Referring Image Segmentation	RefCOCO+ (test-B)	mIoU66.73	267

Showing 10 of 81 rows

...

Other info

Code

Follow for update

@wizwand_team Discord