Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

About

This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to +1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at https://tue-mps.github.io/ALGM.

Narges Norouzi, Svetlana Orlova, Daan de Geus, Gijs Dubbelman• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU52.7
2888
Video Object SegmentationDAVIS 2017 (val)--
1193
Semantic segmentationCityscapes
mIoU75.24
658
Semantic segmentationCityscapes (val)--
572
Semantic segmentationCityscapes
mIoU76.9
218
Semantic segmentationPascal Context
mIoU52.97
217
Semantic segmentationPascal Context (test)
mIoU58
191
Video Object SegmentationSA-V (val)
J&F Score55.7
114
Video Object SegmentationSA-V (test)
J&F56.5
110
Semantic segmentationCOCOStuff 164k (val)--
47
Showing 10 of 20 rows

Other info

Follow for update