Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

About

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan• 2025

Related benchmarks

TaskDatasetResultRank
ClassificationSST2
Accuracy83.14
102
Sentence-pair classificationQQP
Accuracy78.03
44
Language ModelingWikipedia
Perplexity13.83
43
Language ModelingWikiText-103
Perplexity (PPL)16.84
28
Transformer Inference12-layer Transformer 1024 tokens (inference)
Speedup342.7
24
News ClassificationAG-News
Accuracy83.25
9
Prefill LatencyLlama-3-8B 10 Mbps
Prefill Latency (s)1.563
8
Prefill LatencyLlama-3-8B 20 Mbps
Prefill Latency (s)1.549
8
Prefill LatencyLlama-3-8B 50 Mbps
Prefill Latency (s)1.547
8
Prefill LatencyLlama-3-8B 100 Mbps
Prefill Latency (s)1.545
8
Showing 10 of 16 rows

Other info

Follow for update