Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-LSTM: xLSTM as Generic Vision Backbone

About

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Benedikt Alkin, Maximilian Beck, Korbinian P\"oppel, Sepp Hochreiter, Johannes Brandstetter• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU48.8
2888
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy82.4
229
Image Quality AssessmentAGIQA-3K
SRCC0.875
131
Quality AssessmentAIGCIQA 2023
SRCC0.8436
36
Consistency AssessmentAIGCIQA 2023
SRCC0.7174
16
Consistency AssessmentAGIQA-3K
SRCC0.757
15
Left Ventricle SegmentationCAMUS
mDice92.71
9
Left Ventricle SegmentationEchoNet-Dynamic
mDice91.71
9
Left ventricular ejection fraction (LVEF) estimationCAMUS
Correlation Coefficient0.745
9
Left ventricular ejection fraction (LVEF) estimationEchoNet-Dynamic
Correlation (corr)0.691
9
Showing 10 of 10 rows

Other info

Follow for update