Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training

About

Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward-backward pass of a single attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce Repeated Arrays-of-Count Estimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to 64K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU in a single forward-backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today's hardware. We release our code at https://github.com/sahiljoshi515/RACE_Attention.

Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-10
Accuracy65.9
101
ClassificationLRA ListOps N=2000 (test)
Accuracy41.9
39
Text ClassificationQNLI
Accuracy (%)61.1
15
Language ModelingTiny Stories
Perplexity (PPL)2.6
9
Sentiment AnalysisIMDB @512 (test)
Accuracy81.3
8
Sentiment AnalysisSST-2 @1024 (test)
Accuracy79.4
8
Text ClassificationYahoo @256 (test)
Accuracy67.2
8
Text RetrievalLRA Text Retrieval @8000
Accuracy80.9
8
Image ClassificationFood-101 16K sequence length (test)
Train runtime (s)891
7
Long Document ClassificationArxiv Long-Document 16K
Accuracy71.3
7
Showing 10 of 13 rows

Other info

Follow for update