MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

About

The rapid scaling of large language models~(LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at \href{https://github.com/woodenchild95/Maskpro.git}{\ttfamily https://github.com/woodenchild95/Maskpro.git}.

Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	Perplexity (PPL)13.73	2862
Commonsense Reasoning	WinoGrande	Accuracy68.43	1581
Question Answering	ARC Challenge	Accuracy (ARC)36.89	631
Physical Interaction Question Answering	PIQA	Accuracy74.72	462
Mathematical Reasoning	MathQA	Accuracy26.76	354
Question Answering	OpenBookQA	Accuracy29.8	319
Word Sense Disambiguation	WiC	Avg Accuracy49.84	261
Logical reasoning	LogiQA	LogiQA Accuracy22.89	251
Question Answering	ARC Easy	Accuracy69.51	246
Natural Language Inference	CB	Accuracy57.14	129

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord