Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks

About

Biological neural systems employ diverse neurotransmitters -- glutamate, GABA, dopamine, acetylcholine -- to implement distinct signal-processing modalities within shared neural circuits. In contrast, modern transformers apply a single fixed activation function across all feed-forward neurons. We introduce PolyGLU (Polychromatic Gated Linear Unit), a drop-in replacement for SwiGLU that enables each FFN neuron to dynamically route among K=4 activation functions via a differentiable mechanism combining learned static preferences with input-conditioned gating, trained end-to-end with Gumbel-Softmax. We train PolychromaticLM, a 597M-parameter transformer, on ~10B tokens using a single NVIDIA A100 GPU. Our key finding is emergent routing behavior: without any explicit sparsity loss or entropy regularization, the routing mechanism converges to near-deterministic activation selections (mean dynamic entropy = 0.030% of maximum), with a striking depth-dependent specialization pattern -- early layers prefer GELU while deep layers strongly favor Tanh. Three layers maintain elevated routing entropy, suggesting computational flexibility points. The routing architecture adds only 0.23% parameter overhead (~1.4M parameters) and proves fully robust to supervised fine-tuning: routing entropy remains constant at ln(4) throughout 13,067 SFT steps. On standard benchmarks, PolychromaticLM achieves 62-89% of Qwen3-0.6B-Base performance despite training on 3,600x fewer tokens. All code, weights, and training infrastructure are released under Apache 2.0.

Daniel Nobrega Medeiros• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag--
1891
Commonsense ReasoningWinoGrande--
1085
Physical Commonsense ReasoningPIQA
Accuracy58.87
572
Science Question AnsweringARC Challenge
Accuracy24.15
342
Science Question AnsweringARC Easy
Accuracy41.04
155
Word PredictionLAMBADA
Accuracy15.35
148
Science Question AnsweringSciQ
Normalized Accuracy61.2
137
Question AnsweringOpenBookQA
Normalized Accuracy29
102
Reading ComprehensionBoolQ
Score61.13
10
Multi-task Language UnderstandingMMLU STEM
Accuracy28.42
3
Showing 10 of 10 rows

Other info

Follow for update