Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction

About

Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.

Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi• 2026

Related benchmarks

TaskDatasetResultRank
Long-context language modelingRULER 16k context
Accuracy (RULER 16K)83
72
Long-context ReasoningBABILong 16k
Accuracy28.3
72
Long-context language modeling evaluationRULER Context Length = 8K
Average Accuracy (RULER 8K)84.8
72
Long-context ReasoningBABILong 8K
Accuracy33
65
Long-context language modelingRULER
Accuracy88.4
51
Long-context ReasoningBABILong 4K
Accuracy (BABILong 4k)34.3
51
Showing 6 of 6 rows

Other info

Follow for update