Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

About

SVD-based Low-rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD-compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase-specific kernels with dense-KV decode, packed MLP execution, and per-layer CUDA-graph replay to reorganize the low-rank serving path into a thin runtime. Across representative decoder-serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end-to-end speedup, and it attains 1.48x average decode and 1.44x average end-to-end speedup across multiple popular SVD compression families. These results suggest that practical low-rank acceleration requires runtime co-design, not compression algorithms alone. Our code is available at: https://github.com/Zishan-Shao/FlashSVD.

Wenhao Wu, Zishan Shao, Kangning Cui, Jinhee Kim, Yixiao Wang, Hancheng Ye, Danyang Zhuo, Yiran Chen• 2026

Related benchmarks

TaskDatasetResultRank
LLM InferenceLLaMA-7B v1 (serving)
Decode Latency (ms/token)12.16
16
Natural Language InferenceMNLI
Latency (ms)22.52
4
Paraphrase IdentificationQQP
Latency (ms)22.69
4
Semantic Textual SimilaritySTS-B
Latency (ms)22.63
4
Showing 4 of 4 rows

Other info

Follow for update