ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference
About
Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | SST2 | Accuracy83.14 | 102 | |
| Sentence-pair classification | QQP | Accuracy78.03 | 44 | |
| Language Modeling | Wikipedia | Perplexity13.83 | 43 | |
| Language Modeling | WikiText-103 | Perplexity (PPL)16.84 | 28 | |
| Transformer Inference | 12-layer Transformer 1024 tokens (inference) | Speedup342.7 | 24 | |
| News Classification | AG-News | Accuracy83.25 | 9 | |
| Prefill Latency | Llama-3-8B 10 Mbps | Prefill Latency (s)1.563 | 8 | |
| Prefill Latency | Llama-3-8B 20 Mbps | Prefill Latency (s)1.549 | 8 | |
| Prefill Latency | Llama-3-8B 50 Mbps | Prefill Latency (s)1.547 | 8 | |
| Prefill Latency | Llama-3-8B 100 Mbps | Prefill Latency (s)1.545 | 8 |