Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

About

The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%.

Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy93.1
2019
Visual Question AnsweringVizWiz
Accuracy60.5
1820
Video UnderstandingMVBench
Accuracy53.6
563
Multimodal Perception and CognitionMME--
270
Long Video UnderstandingLongVideoBench
Score58
269
Multimodal UnderstandingSEED
Accuracy76.6
216
Long Video UnderstandingMLVU--
205
Video UnderstandingEgoSchema--
185
Video UnderstandingMLVU
Accuracy54.5
114
Multimodal UnderstandingSEED-I Image
Accuracy0.766
75
Showing 10 of 19 rows

Other info

Follow for update