Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

About

Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight module, named twig, upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of the visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Moreover, we extend TwigVLM to an improved TwigVLM++ variant by introducing a novel multi-head twig architecture with a specialized pruning head. TwigVLM++ improves pruning quality via a two-stage training paradigm combining a distillation learning stage and a pruning-oriented reinforcement learning stage, and further accelerates inference via a tree-based SSD strategy.

Zhenwei Shao, Mingyang Wang, Weijun Zhang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Jun Yu• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy82.7	2019
Visual Question Answering	VQA v2	Accuracy75.6	1429
Visual Question Answering	GQA	Accuracy63.4	1425
Text-based Visual Question Answering	TextVQA	Accuracy58.6	962
Multimodal Understanding	MMBench	Accuracy67.6	847
Visual Question Answering	GQA	Accuracy61.2	524
Optical Character Recognition	OCRBench	Score825	433
Multimodal Understanding	MMStar	--	407
Visual Question Answering	VQA v2	Accuracy81.2	333
Multimodal Perception and Cognition	MME	Overall Score1.87e+3	270

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord