Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

About

Multimodal foundation models are increasingly trained on sensitive data across domains such as finance, biomedicine, and personal identifiers. However, this distributed setup raises serious privacy concerns due to the need for cross-partition data sharing. Split learning addresses these concerns by enabling collaborative model training without raw data exchange between partitions, yet it introduces a significant challenge: transmitting high-dimensional intermediate feature representations between partitions leads to substantial communication costs. To address this challenge, we propose Quantized-TinyLLaVA, a multimodal foundation model with an integrated communication-efficient split learning framework. Our approach adopts a compression module that quantizes intermediate feature into discrete representations before transmission, substantially reducing communication overhead. Besides, we derive a principled quantization strategy grounded in entropy coding theory to determine the optimal number of discrete representation levels. We deploy our framework in a two-partition setting, with one partition operating as the client and the other as the server, to realistically simulate distributed training. Under this setup, Quantized-TinyLLaVA achieves an approximate \textbf{87.5\%} reduction in communication overhead with 2-bit quantization, while maintaining performance of the original 16-bit model across five benchmark datasets. Furthermore, our compressed representations exhibit enhanced resilience against feature inversion attacks, validating the privacy of transmission. The code is available at https://github.com/anonymous-1742/Quantized-TinyLLaVA.

Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, Jie Liu• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	VQA v2	--	1429
Multimodal Evaluation	MME	--	902
Multimodal Capability Evaluation	MM-Vet	Score20.4	429
Aggregated Performance Benchmarking	Combined Multimodal Evaluation Summary	Overall Score53.3	17
Communication Cost Analysis	LLaVa 1.5	Total Latency (s)97.268	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord