Vision-LSTM: xLSTM as Generic Vision Backbone

About

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Benedikt Alkin, Maximilian Beck, Korbinian P\"oppel, Sepp Hochreiter, Johannes Brandstetter• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU48.8	3069
Image Classification	ImageNet-1k (val)	Top-1 Accuracy82.4	708
Image Classification	ImageNet-1k 1.0 (test)	Top-1 Accuracy82.4	251
Image Quality Assessment	AGIQA-3K	SRCC0.875	137
Quality Assessment	AIGCIQA 2023	SRCC0.8436	36
Consistency Assessment	AIGCIQA 2023	SRCC0.7174	16
Consistency Assessment	AGIQA-3K	SRCC0.757	15
Left Ventricle Segmentation	CAMUS	mDice92.71	9
Left Ventricle Segmentation	EchoNet-Dynamic	mDice91.71	9
Left ventricular ejection fraction (LVEF) estimation	CAMUS	Correlation Coefficient0.745	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord