Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

About

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.

Calvin Galagain, Martyna Poreba, Fran\c{c}ois Goulette• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU53.1
3069
Semantic segmentationCityscapes
mIoU82.9
494
Panoptic SegmentationADE20K (val)
PQ45.1
99
Instance SegmentationCOCO Panoptic
mAP37.4
8
Panoptic SegmentationCOCO Panoptic
PQ50.9
8
Showing 5 of 5 rows

Other info

Follow for update