Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
About
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy26.4 | 204 | |
| Diagram Understanding | AI2D (test) | Accuracy46.2 | 131 | |
| Object Hallucination Evaluation | POPE (test) | Accuracy69.4 | 79 | |
| Science Question Answering | ScienceQA IMG (test) | Accuracy56.7 | 74 | |
| Visual Question Answering | VQA v2 (test val) | Accuracy76.6 | 26 | |
| Multimodal Evaluation | MME (test) | Perception Score1.38e+3 | 13 | |
| Multimodal Benchmarking | MMBench (dev) | Accuracy64.6 | 6 |