Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

About

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of routed experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 2.93$\times$ speedup on Jetson AGX Orin compared with dense inference. Code and checkpoints are available at https://github.com/thunlp/DECO.

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu• 2026

Related benchmarks

TaskDatasetResultRank
Inference AccelerationSpec-Bench
Speedup3
53
Commonsense ReasoningCommonsense Reasoning Suite (PIQA, SIQA, HellaSwag, ARC-E, ARC-C, WinoGrande, LAMBADA) zero-shot
PIQA Accuracy66.14
18
Commonsense ReasoningCommonsense Reasoning Suite Zero-shot
PIQA Accuracy64.36
9
Zero-shot Language UnderstandingCommonsense Reasoning and Language Modeling Suite (PIQA, SIQA, HellaSwag, ARC-E, ARC-C, WinoGrande, LAMBADA) zero-shot
PIQA Accuracy70.24
6
Showing 4 of 4 rows

Other info

GitHub

Follow for update