PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
About
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy84.98 | 354 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy90.38 | 344 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.9289 | 342 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy85.91 | 300 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy85.83 | 300 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU76.94 | 259 | |
| Referring Expression Segmentation | RefCOCO (testA) | cIoU78.5 | 257 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU66.73 | 252 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy77.97 | 244 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | cIoU75.71 | 230 |