EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
About
What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | C4 | Perplexity260 | 1071 | |
| Language Modeling | Wiki | Perplexity (PPL)729 | 281 | |
| Grammatical Role Assignment | Grammar Within A (held-out) | Accuracy96 | 2 | |
| Grammatical Role Assignment | Grammar Transfer B (held-out) | Accuracy61.8 | 2 | |
| Grammatical Role Assignment | Grammar Deep-embedded roles (held-out) | Accuracy84 | 2 |