Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

About

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha• 2026

Related benchmarks

TaskDatasetResultRank
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy26.4
204
Diagram UnderstandingAI2D (test)
Accuracy46.2
131
Object Hallucination EvaluationPOPE (test)
Accuracy69.4
79
Science Question AnsweringScienceQA IMG (test)
Accuracy56.7
74
Visual Question AnsweringVQA v2 (test val)
Accuracy76.6
26
Multimodal EvaluationMME (test)
Perception Score1.38e+3
13
Multimodal BenchmarkingMMBench (dev)
Accuracy64.6
6
Showing 7 of 7 rows

Other info

Follow for update