Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attention Drift: What Autoregressive Speculative Decoding Models Learn

About

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to $2\times$ under template perturbation, $1.18\times$ on long-context tasks, and $1.10\times$ on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

Do\u{g}a\c{c} Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500--
543
Multi-turn conversationMT-Bench--
76
Instruction FollowingAlpaca
Average Accepted Length3.57
51
Code GenerationLiveCodeBench
Mean Acceptance Length (τ)4.71
22
Scientific Question AnsweringGPQA
Avg Response Length4.78
13
Multi-turn Chat EvaluationMT-Bench
Acceptance Length3.61
8
Instruction FollowingAlpaca
Acceptance Length3.59
6
Code GenerationHumanEval
Acceptance Length5.21
4
Code GenerationHumanEval
SGLang Acceptance Length4.49
4
CodingLongBench Repobench (test)
Avg Accepted Draft Tokens/Round2.26
4
Showing 10 of 21 rows

Other info

Follow for update