Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Retraining-Free Merging of Sparse MoE via Hierarchical Clustering

About

Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.

I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2
Perplexity (PPL)11.62
2320
Commonsense ReasoningHellaSwag
Accuracy57.81
1896
Commonsense ReasoningWinoGrande
Accuracy72.06
1442
Mathematical ReasoningGSM8K
Accuracy86.4
1398
Question AnsweringARC Challenge
Accuracy35
906
Language UnderstandingMMLU
Accuracy48.95
844
Physical Commonsense ReasoningPIQA
Accuracy76
696
Question AnsweringARC Easy
Accuracy66
597
Mathematical ReasoningMathQA
Accuracy41
354
Multiple-choice Question AnsweringARC Easy
Accuracy72.9
257
Showing 10 of 35 rows

Other info

Follow for update