Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation

About

Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher's feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.

Hongbo Jiang, Jie Li, Xinqi Cai, Tianyu Xie, Yunhang Shen, Pingyang Dai, Liujuan Cao• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-image Person Re-identificationCUHK-PEDES (test)
Rank-1 Accuracy (R-1)65.29
150
Text-to-image Person Re-identificationICFG-PEDES (test)
Rank-10.5949
81
Text-based Person Re-identificationRSTPReid (test)
Rank-1 Acc59.28
52
Cross-modal Person Re-identificationCUHK-PEDES (test)
Rank@192.87
24
Sketch-to-Real Person Re-identificationICFG-PEDES (test)
Rank-1 Accuracy (R1)88.1
7
Sketch-to-Real Person Re-identificationRSTPReid (test)
Rank-1 Accuracy (R1)87.01
7
Infrared-to-Real Person Re-identificationCUHK-PEDES (test)
Rank-1 Accuracy (R1)92.87
6
Infrared-to-Real Person Re-identificationICFG-PEDES (test)
Rank-1 (R1)88.95
6
Infrared-to-Real Person Re-identificationRSTPReid (test)
R185.89
6
Showing 9 of 9 rows

Other info

Follow for update