HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
About
Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-$k$ Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-$k$ attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2$\times$ speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-$k$ attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at https://github.com/gpzlx1/HATA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context Language Understanding | InfiniteBench | En.Sum19.27 | 63 | |
| Long-context Language Understanding | LongBench-e (test) | LCC (Language Comprehension Score)68.42 | 16 | |
| Long-context evaluation | RULER 32K context length (test) | Niah1 Score100 | 12 | |
| Retrieval | RULER 128K context | -- | 12 | |
| Long-context Language Understanding | LongBench-e | LCC44.86 | 9 | |
| Long-context evaluation | RULER 256K | NS1 (Sequence Accuracy 1)100 | 3 | |
| Multiple-choice Question Answering | LongBench v2 (test) | Accuracy (Easy, Short)38.98 | 3 | |
| Long-context Understanding | LongBench-e | LCC53.9 | 2 |