Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Token Merging: Your ViT But Faster

About

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman• 2022

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy76
1165
Video Object SegmentationDAVIS 2017 (val)--
1130
Visual Question AnsweringTextVQA
Accuracy80.22
1117
Visual Question AnsweringVizWiz
Accuracy55.9
1043
Visual Question AnsweringGQA
Accuracy64.49
963
Object Hallucination EvaluationPOPE
Accuracy72.4
935
Multimodal EvaluationMME
Score1.47e+3
557
Image ClassificationImageNet-1k (val)--
512
Text-based Visual Question AnsweringTextVQA
Accuracy45.3
496
Image ClassificationDTD
Accuracy69.9
487
Showing 10 of 84 rows
...

Other info

Code

Follow for update