InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
About
Balancing fine-grained local modeling with long-range dependency capture under computational constraints remains a central challenge in sequence modeling. While Transformers provide strong token mixing, they suffer from quadratic complexity, whereas Mamba-style selective state-space models (SSMs) scale linearly but often struggle to capture high-rank and synchronous global interactions. We present a consistency boundary analysis that characterizes when diagonal short-memory SSMs can approximate causal attention and identifies structural gaps that remain. Motivated by this analysis, we propose InfoMamba, an attention-free hybrid architecture. InfoMamba replaces token-level self-attention with a concept bottleneck linear filtering layer that serves as a minimal-bandwidth global interface and integrates it with a selective recurrent stream through information-maximizing fusion (IMF). IMF dynamically injects global context into the SSM dynamics and encourages complementary information usage through a mutual-information-inspired objective. Extensive experiments on classification, dense prediction, and non-vision tasks show that InfoMamba consistently outperforms strong Transformer and SSM baselines, achieving competitive accuracy-efficiency trade-offs while maintaining near-linear scaling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.1 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4.1 | 1151 | |
| Image Classification | Food-101 | -- | 542 | |
| Object Detection | MS-COCO | AP55.3 | 120 | |
| Instance Segmentation | MS-COCO | mAP Mask47.9 | 60 | |
| Segmentation | ADE20K | mIoU53 | 59 | |
| Segmentation | Cityscapes | mIoU84.3 | 37 | |
| Sentiment Analysis | IMDB | Accuracy85.1 | 13 | |
| Natural Language Understanding | AGNews | Accuracy89.1 | 9 | |
| Image Classification | Food-11 | Top-1 Accuracy91 | 5 |