ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

About

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan• 2025

Related benchmarks

Task	Dataset	Result
Classification	SST2	Accuracy83.14	108
Sentence-pair classification	QQP	Accuracy78.03	44
Language Modeling	Wikipedia	Perplexity13.83	43
Language Modeling	WikiText-103	Perplexity (PPL)16.84	28
Transformer Inference	12-layer Transformer 1024 tokens (inference)	Speedup342.7	24
News Classification	AG-News	Accuracy83.25	9
Prefill Latency	Llama-3-8B 10 Mbps	Prefill Latency (s)1.563	8
Prefill Latency	Llama-3-8B 20 Mbps	Prefill Latency (s)1.549	8
Prefill Latency	Llama-3-8B 50 Mbps	Prefill Latency (s)1.547	8
Prefill Latency	Llama-3-8B 100 Mbps	Prefill Latency (s)1.545	8

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord