Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

About

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy57.3	906
Multi-task Language Understanding	MMLU	Accuracy63.7	881
Commonsense Reasoning	PIQA	Accuracy78.9	757
Question Answering	ARC Easy	Normalized Acc81.5	391
Boolean Question Answering	BoolQ	Accuracy77.3	350
Question Answering	OBQA	Accuracy42.2	347
Question Answering	SciQ	Accuracy95.3	283
Commonsense Reasoning	SIQA	Accuracy47.5	168
Logical reasoning	LogiQA	Accuracy30.4	100
Multi-level multi-discipline evaluation	C-Eval	Accuracy44.7	28

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord