Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

About

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity21.53
2839
Language ModelingWikiText-2 (test)
PPL3.64
1949
Language ModelingWikiText-2
Perplexity (PPL)3.64
1624
Language ModelingWikiText2 v1 (test)
Perplexity4.96
383
Inference LatencyOPT model family
Latency (ms)6.3
79
Inference LatencyA100 GPU
Latency (ms)26.7
48
Energy Consumption EstimationOPT-125M
Energy Consumption (J)11.41
8
Energy Consumption EstimationOPT-350M
Energy (J)20.17
8
Energy Consumption EstimationOPT-1.3B
Energy (J)55.8
8
Energy Consumption EstimationOPT-2.7B
Energy (J)95.67
8
Showing 10 of 15 rows

Other info

Code

Follow for update