Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

xLSTM: Extended Long Short-Term Memory

About

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Maximilian Beck, Korbinian P\"oppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G\"unter Klambauer, Johannes Brandstetter, Sepp Hochreiter• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-103
PPL21.47
146
Fluid Dynamics PredictionDam (test)
MAE0.0293
20
Fluid Dynamics PredictionChannel (test)
MAE0.0489
20
Fluid Dynamics PredictionCavity (test)
MAE0.0169
20
Fluid Dynamics PredictionHigh-Re (test)
MAE0.0768
20
Fluid Dynamics PredictionLow-Re (test)
MAE0.0585
20
Echocardiography Video Segmentation and Ejection Fraction EstimationCAMUS
Pearson Correlation80.6
18
Echocardiography Video SegmentationCAMUS
mDice92.14
9
Echocardiography Video SegmentationEchoNet-Dynamic
Dice Coefficient90.24
9
Fluid Dynamics PredictionDam Zero-shot from Channel
MAE0.0693
9
Showing 10 of 14 rows

Other info

Follow for update