Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models

About

Vision Language Models (VLMs) are central to Visual Question Answering (VQA) systems and are typically deployed in the cloud due to their high computational demands. However, this cloud-only approach underutilizes edge computational resources and requires significant bandwidth for transmitting raw images. In this paper, we introduce an edge-cloud collaborative VQA system, called LLaVA-AlignedVQ, which features a novel Aligned Vector Quantization algorithm (AlignedVQ) that efficiently compress intermediate features without compromising accuracy to support partitioned execution. Our experiments demonstrate that LLaVA-AlignedVQ achieves approximately 1365x compression rate of intermediate features, reducing data transmission overhead by 96.8% compared to transmitting JPEG90-compressed images to the cloud. LLaVA-AlignedVQ achieves an inference speedup of 2-15x while maintaining high accuracy, remaining within -2.23% to +1.6% of the original model's accuracy performance across eight VQA datasets, compared to the cloud-only solution.

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	VizWiz	Accuracy47.25	1820
Visual Question Answering	TextVQA	Accuracy58.06	1453
Visual Question Answering	VQA v2	Accuracy79.98	1429
Visual Question Answering	GQA	Accuracy63.7	1425
Multimodal Evaluation	MM-Vet	--	196
Visual Question Answering	LLaVA-Bench In-the-Wild	Score62.7	38
Multimodal Benchmarking	MMBench	MMBench Score (en)65.37	23
Data Compression	Intermediate Features (1, 577, 1024)	Size (KB)0.845	3
Data Compression	Images 336 x 336	--	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord