EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

About

What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.

Arth Singh• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	C4	Perplexity260	1688
Language Modeling	Wiki	Perplexity (PPL)729	298
Grammatical Role Assignment	Grammar Within A (held-out)	Accuracy96	2
Grammatical Role Assignment	Grammar Transfer B (held-out)	Accuracy61.8	2
Grammatical Role Assignment	Grammar Deep-embedded roles (held-out)	Accuracy84	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord