Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Accelerating Sparse Transformer Inference on GPU

About

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun• 2025

Related benchmarks

TaskDatasetResultRank
End-to-end inference tuningBERT base
Tuning Time (s)23.3
9
End-to-end inference tuningBERT large
Tuning Time (s)22.6
9
End-to-end inference tuningGPT
Tuning Time (s)23.8
9
End-to-end inference tuningLLAMA
Tuning Time (s)29.5
9
End-to-end inference tuningT5
Tuning Time (s)43.1
9
End-to-end inference tuningViT
Tuning Time (s)93.9
9
Showing 6 of 6 rows

Other info

Follow for update