Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Star Attention: Efficient LLM Inference over Long Sequences

About

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.

Shantanu Acharya, Fei Jia, Boris Ginsburg• 2024

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingInfiniteBench
En.Sum30.55
63
Video UnderstandingLongVideoBench (test)
Accuracy (8-15s)74.07
21
Long Video UnderstandingVNBench
Retrieval E Accuracy90.67
21
Long-context language modeling and retrievalRULER
VT Score83.96
14
Showing 4 of 4 rows

Other info

Follow for update