SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
About
Multimodal Large Language Models (MLLMs) have achieved remarkable progress but incur substantial computational overhead and energy consumption during inference, limiting deployment in resource-constrained environments. Spiking Neural Networks (SNNs), with their sparse event-driven computation, offer inherent energy efficiency advantages on neuromorphic hardware, yet extending them to MLLMs faces two key challenges: heterogeneous modalities make uniform spike encoding insufficient, and high-resolution image inputs amplify timestep unfolding overhead. We propose SpikeMLLM, the first spike-based framework for MLLMs, which unifies existing ANN quantization methods in the spiking representation space and incorporates Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) and Temporally Compressed LIF (TC-LIF) for timestep compression from T=L-1 to T=log2(L)-1. Experiments on four representative MLLMs across diverse multimodal benchmarks show that SpikeMLLM maintains near-lossless performance under aggressive timestep compression (Tv/Tt=3/4), with average gaps of only 0.72% and 1.19% relative to the FP16 baseline on InternVL2-8B and Qwen2VL-72B. We further develop a dedicated RTL accelerator tailored to the spike-driven datapath, observing 9.06x higher throughput and 25.8x better power efficiency relative to an FP16 GPU baseline under a deployment-oriented co-design setting, suggesting the promise of algorithm-hardware co-design for efficient multimodal intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy83.46 | 1453 | |
| Science Question Answering | ScienceQA | Accuracy89.84 | 791 | |
| Optical Character Recognition | OCRBench | Score829 | 433 | |
| Multimodal Perception and Cognition | MME | Overall Score2.24e+3 | 270 | |
| Document Visual Question Answering | DocVQA | Accuracy95.54 | 203 | |
| Optical Character Recognition Evaluation | OCRBench | Score817 | 91 | |
| Multimodal Model Evaluation | MME | MME Score2.45e+3 | 77 | |
| Multimodal Scientific Reasoning | ScienceQA | Accuracy96.83 | 28 | |
| Multimodal Large Language Model Inference | Qwen2VL-7B FP16 (inference) | Power Consumption (W)7.13 | 2 |