Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Token Pruning for LLaDA-V

About

Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V's delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.

Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal EvaluationMME
Score1.91e+3
557
Multimodal UnderstandingMMMU
Accuracy49.89
275
Visual Question AnsweringChartQA
Accuracy77
239
Multimodal UnderstandingMMStar
Accuracy58.67
197
Visual Question AnsweringAI2D
Accuracy77.75
174
Visual Question AnsweringRealworldQA
Accuracy64.84
98
Multimodal UnderstandingMMMU-Pro
Vis Accuracy17.57
20
Showing 7 of 7 rows

Other info

Follow for update