Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

About

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha• 2026

Related benchmarks

Task	Dataset	Result
Multi-discipline Multimodal Understanding	MMMU (val)	Accuracy26.4	212
Diagram Understanding	AI2D (test)	Accuracy46.2	154
Object Hallucination Evaluation	POPE (test)	Accuracy69.4	107
Science Question Answering	ScienceQA IMG (test)	Accuracy56.7	74
Visual Question Answering	VQA v2 (test val)	Accuracy76.6	26
Multimodal Evaluation	MME (test)	Perception Score1.38e+3	13
Multimodal Benchmarking	MMBench (dev)	Accuracy64.6	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord