Residual-Mass Accounting for Partial-KV Decoding
About
We study a controlled partial-KV decoding setting in which exact unnormalized softmax contributions are computed for sink/tail anchors and a retrieved token set, while the remaining prefill tokens are represented by a residual estimate. We focus on the accounting rule after the query-dependent exact support has been selected, and use exhaustive Top-K only as an oracle selector, not as a deployable retrieval system. The proposed rule leaves the backbone language model and the exact-branch KV tensors unchanged. It builds fixed-size summary states $(S,u)$ from learned positive feature maps $\phi$, subtracts retrieved-token feature contributions to keep the exact and residual sets non-overlapping, and merges the estimated residual numerator and denominator with the exact branch under one normalization. At a 1% exact-support budget, our residual-completion method improves over the selection-only Top-K baseline on RULER and BABILong across frozen 1B and 3B Llama-3.2-Instruct backbones at all reported context lengths. In the 0.5-4% exact-support budget sweeps, this trend largely persists. On LongBench, summarization results are mostly favorable, while multi-document QA is mixed. Attention-output diagnostics support retrieved-token subtraction as the partition-consistent accounting rule, while indicating that the main remaining error is imperfect learned-$\phi$ approximation of the unretrieved residual mass.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context language modeling | RULER | -- | 75 | |
| Long-context language modeling | RULER 16k context | Accuracy (RULER 16K)83 | 72 | |
| Long-context Reasoning | BABILong 16k | Accuracy28.3 | 72 | |
| Long-context language modeling evaluation | RULER Context Length = 8K | Average Accuracy (RULER 8K)84.8 | 72 | |
| Long-context Reasoning | BABILong 8K | Accuracy33 | 65 | |
| Long-context Reasoning | BABILong 4K | Accuracy (BABILong 4k)34.3 | 51 |