AXLearn: Modular, Hardware-Agnostic Large Model Training
About
AXLearn is a production system which facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for hardware-agnostic training. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on different hardware infrastructure. AXLearn maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in state-of-the-art training systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn at Apple.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Online Inference | ShareGPT | -- | 32 | |
| LLM Training | LLaMA2-7B | Iteration Time (s)1.4 | 4 | |
| LLM Training | Llama2-70B (64 x H100-8) | Iteration Time (s)9.2 | 4 | |
| LLM Training | Llama2-7B tpu-v5p-512 | Iteration Time (s)2.5 | 3 | |
| LLM Training | Llama2 70B (tpu-v5p-1024) | Iteration Time (s)11.6 | 2 | |
| LLM Training | Qwen-3 30B-A3B (tpu-v5p-1024) | Iteration Time (s)12.86 | 2 | |
| LLM Training | Qwen-3 30B-A3B (64 x B200-8) | Iteration Time (s)4.31 | 2 | |
| LLM Training | Llama2 7B 64 x Trainium2-16 | Iteration Time (s)1.2 | 1 | |
| LLM Training | Llama2 70B 64 x Trainium2-16 | Iteration Time (s)11.2 | 1 |