Efficient Token Pruning for LLaDA-V

About

Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V's delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.

Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Evaluation	MME	Score1.91e+3	727
Visual Question Answering	ChartQA	Accuracy77	519
Multimodal Understanding	MMMU	Accuracy49.89	437
Multimodal Understanding	MMStar	Accuracy58.67	407
Visual Question Answering	AI2D	Accuracy77.75	317
Visual Question Answering	RealworldQA	Accuracy64.84	259
Multimodal Understanding	MMMU-Pro	Vis Accuracy17.57	20

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord