Accelerating Sparse Transformer Inference on GPU

About

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun• 2025

Related benchmarks

Task	Dataset	Result
End-to-end inference tuning	BERT base	Tuning Time (s)23.3	9
End-to-end inference tuning	BERT large	Tuning Time (s)22.6	9
End-to-end inference tuning	GPT	Tuning Time (s)23.8	9
End-to-end inference tuning	LLAMA	Tuning Time (s)29.5	9
End-to-end inference tuning	T5	Tuning Time (s)43.1	9
End-to-end inference tuning	ViT	Tuning Time (s)93.9	9

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord