Communication Efficient Distributed Training with Distributed Lion

About

The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.

Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, Qiang Liu• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1k (val)	--	1498
Question Answering	ARC Challenge	--	906
Question Answering	OpenBookQA	Accuracy35.71	465
Physical Interaction Question Answering	PIQA	Accuracy78.92	415
Sentence Completion	HellaSwag	Accuracy59.06	364
Boolean Question Answering	BoolQ	Accuracy77.14	350
Science Question Answering	ARC-E	Accuracy76.86	240
Social Interaction Question Answering	SIQA	Accuracy49.75	157
Language Modeling	OpenWebText 1 (val)	Validation Perplexity14.66	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord