PEM: Prototype-based Efficient MaskFormer for Image Segmentation

About

Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.

Niccol\`o Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, Fabio Cermelli• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU45	3069
Instance Segmentation	COCO 2017 (val)	--	1275
Semantic segmentation	ADE20K	mIoU45.5	1028
Semantic segmentation	Cityscapes	mIoU79.9	668
Semantic segmentation	ADE20K	mIoU45.5	559
Panoptic Segmentation	COCO 2017 (val)	PQ46.38	185
Semantic segmentation	COCO 2017 (val)	mIoU55.95	66
Panoptic Segmentation	Cityscapes	PQ61.07	48
Semantic segmentation	Cityscapes (val)	mIoU79	27
Semantic segmentation	Cityscapes (val)	mIoU79	18

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord