Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

About

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozi\`ere, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li• 2024

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU 5-shot--
132
Paraphrase IdentificationPAWS-X
Accuracy53.5
57
Natural Language InferenceXNLI 1.0 (test)
Accuracy39.1
38
Causal ReasoningXCOPA
Accuracy55.6
33
Story ReasoningXStoryCloze
Accuracy56.4
27
Pronoun ResolutionXWinograd
Accuracy55.8
16
General PerformanceAggregated LLM Evaluation Suite
Average Score47.9
10
Question AnsweringNatural Questions and TriviaQA 5 shot
Knowledge41
10
Commonsense ReasoningARC-Easy, ARC-Challenge, SIQA, PIQA, and WinoGrande 0-shot
Reasoning Accuracy63.7
10
Mathematical ReasoningMath Evaluation Suite
Math Score27.4
10
Showing 10 of 15 rows

Other info

Follow for update