Residual-Mass Accounting for Partial-KV Decoding

About

We study a controlled partial-KV decoding setting in which exact unnormalized softmax contributions are computed for sink/tail anchors and a retrieved token set, while the remaining prefill tokens are represented by a residual estimate. We focus on the accounting rule after the query-dependent exact support has been selected, and use exhaustive Top-K only as an oracle selector, not as a deployable retrieval system. The proposed rule leaves the backbone language model and the exact-branch KV tensors unchanged. It builds fixed-size summary states $(S,u)$ from learned positive feature maps $\phi$, subtracts retrieved-token feature contributions to keep the exact and residual sets non-overlapping, and merges the estimated residual numerator and denominator with the exact branch under one normalization. At a 1% exact-support budget, our residual-completion method improves over the selection-only Top-K baseline on RULER and BABILong across frozen 1B and 3B Llama-3.2-Instruct backbones at all reported context lengths. In the 0.5-4% exact-support budget sweeps, this trend largely persists. On LongBench, summarization results are mostly favorable, while multi-document QA is mixed. Attention-output diagnostics support retrieved-token subtraction as the partition-consistent accounting rule, while indicating that the main remaining error is imperfect learned-$\phi$ approximation of the unretrieved residual mass.

Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi• 2026

Related benchmarks

Task	Dataset	Result
Long-context language modeling	RULER	--	75
Long-context language modeling	RULER 16k context	Accuracy (RULER 16K)83	72
Long-context Reasoning	BABILong 16k	Accuracy28.3	72
Long-context language modeling evaluation	RULER Context Length = 8K	Average Accuracy (RULER 8K)84.8	72
Long-context Reasoning	BABILong 8K	Accuracy33	65
Long-context Reasoning	BABILong 4K	Accuracy (BABILong 4k)34.3	51

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord