Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PithTrain: A Compact and Agent-Native MoE Training System

About

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry, Chenyan Xiong, Tianqi Chen• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringATE-Bench Q&A suite (Q1–Q12)
Agent Turns21
36
Training ThroughputQwen3-30B-A3B workload
Throughput (tokens/s)2.80e+5
7
Dynamic Mixture of Experts (DynMoE) implementationNew-feature tasks
Session Duration60.4
6
Agentic Task AutomationGetting Started Operate-and-Profile
Session Duration (s)6.6
3
Agentic Task AutomationOperate-and-Profile (train)
Session Duration38.5
3
Agentic Task AutomationCollect Routing Trace Operate-and-Profile
Session Duration16.3
3
Agentic Task AutomationReport Heavy Kernels Operate-and-Profile
Session Duration11.8
3
Differential Transformer (Diff) implementationNew-feature tasks
Session Duration38.2
3
Mixture of Block Attention (MoBA) implementationNew-feature tasks
Session Duration38.7
3
Training ThroughputDeepSeek-V2-Lite workload
Training Throughput (tokens/s)1.15e+5
3
Showing 10 of 11 rows

Other info

Follow for update