CEMG: Collaborative-Enhanced Multimodal Generative Recommendation
About
Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Generative Recommendation | Beauty | HR@106.65 | 10 | |
| Multimodal Generative Recommendation | Sports | HR@103.63 | 10 | |
| Multimodal Generative Recommendation | Yelp | HR@104.58 | 10 | |
| Cold-start recommendation | Beauty (test) | HR@100.0305 | 4 | |
| Cold-start recommendation | Sports (test) | HR@101.83 | 4 | |
| Cold-start recommendation | Yelp (test) | HR@102.31 | 4 |