HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

About

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

HyperAI Team: Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.9	2056
Multimodal Evaluation	MME	Score2.11e+3	902
Mathematical Reasoning	MathVista	Score66.2	566
Multimodal Capability Evaluation	MM-Vet	Score59	429
OCR Evaluation	OCRBench	Score859	350
Text-based Visual Question Answering	TextVQA (val)	Accuracy78.8	276
Multimodal Reasoning	MMMU	Accuracy44.6	220
Document Visual Question Answering	DocVQA	Accuracy92.2	203
Multimodal Evaluation	MMStar	Accuracy61.1	177
Multimodal Reasoning	MMMU-Pro	Accuracy23.9	171

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord