Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

About

Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.

Hoang Anh Duy Le, Sahil Joshi, Zeyu Yang, Zhaozhuo Xu, Anshumali Shrivastava (1) __INSTITUTION_5__ Rice University, (2) Stevens Institute of Technology)• 2026

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingLongBench
M-Avg47.67
219
Long-context UnderstandingLongBench (test)
Avg Score48
80
Long-context UnderstandingLongBench
2WikiMQA45.54
25
Long-context performance evaluationRULER
Accuracy95
10
Long-context UnderstandingRULER
Accuracy (4K)96.56
8
Long-context UnderstandingRULER (test)
RULER Accuracy (4K)95.94
4
Showing 6 of 6 rows

Other info

Follow for update