Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

About

Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across \textbf{LLaMA2-7B}, \textbf{LLaMA3-8B}, \textbf{Qwen2.5-7B}, and \textbf{Qwen2.5-14B}, LoRA-Drop achieves up to \textbf{2.6$\times$ faster decoding} and \textbf{45--55\% KV-cache reduction} while staying within \textbf{0.5 percentage points (pp)} of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \emph{safe zone} of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA-Drop.git.

Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy77
1891
Commonsense ReasoningWinoGrande
Accuracy73.6
1085
Code GenerationHumanEval
Pass@140.7
1036
Multi-task Language UnderstandingMMLU--
876
Commonsense ReasoningPIQA
Accuracy81.1
751
ReasoningBBH
Accuracy39
672
Mathematical ReasoningMATH
Accuracy16.5
535
Question AnsweringARC Easy
Normalized Acc80.2
389
Language ModelingLAMBADA
Accuracy74.6
268
Natural Language InferenceXNLI
Accuracy69
111
Showing 10 of 14 rows

Other info

Follow for update