Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

About

Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.

Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringHotpotQA
EM59.5
173
Question AnsweringHotpotQA
F175.5
132
Question AnsweringNQ
EM50.2
45
PrefillStage-aware Prefill
TTFT (ms)63.29
32
Prefill KV-cache memory measurementTULU-3 (dev)
Active KV-cache Memory (GiB)0.109
32
Stage-aware PrefillTULU-3 (dev)
Total FLOPs (teraFLOPs)13.13
32
Long-context retrievalS-NIAH
Exact Match Accuracy99.6
12
Mathematical ReasoningMathBench
Accuracy53
11
General CapabilityOLMES benchmarks
Average Score51.3
9
Inference Efficiency128K-context
TTFT101
8
Showing 10 of 22 rows

Other info

Follow for update