StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

About

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.

Duy M. H. Nguyen, Tuan A. Tran, Duong Nguyen, Siwei Xie, Trung Q. Nguyen, Mai T. N. Truong, Daniel Palenicek, An T. Le, Michael Barz, TrungTin Nguyen, Tuan Dam, Ngan Le, Minh Vu, Khoa Doan, Vien Ngo, Pengtao Xie, James Zou, Daniel Sonntag, Jan Peters, Mathias Niepert• 2026

Related benchmarks

Task	Dataset	Result
Zero-shot Segmentation	MS-COCO	mAP45.1	37
Object Segmentation	COIFT	mIoU94.37	20
Object Segmentation	ThinObject5K (test)	mIoU89	20
Object Segmentation	HRSOD	mIoU92.32	20
Object Segmentation	DIS5K VD	mIoU76.58	20
Abnormalities Segmentation	INbreast	Dice74.81	16
Segmentation	ThinObject5K (test)	mIoU75.8	10
Segmentation	DIS5K	mIoU61.01	10
Segmentation	COIFT	mIoU90.73	10
Segmentation	HRSOD	mIoU88.39	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord