Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Taming the Long Tail: Denoising Collaborative Information for Robust Semantic ID Generation

About

Item IDs form the backbone of industrial recommender systems, but suffer from representation instability and poor long-tail generalization in large, dynamic item corpora. Semantic IDs (SIDs) mitigate these issues by enabling knowledge sharing through quantization of item content features. Existing methods attempt to enhance SID expressiveness by incorporating collaborative information with content features; however, they often overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed, resulting in a significant quality gap in collaborative information between popular and long-tail items. This mismatch leads to two critical limitations: (1) Collaborative Noise Corrupts Behavior-Content Alignment: Behavior-content alignment is a prevailing approach for modeling shared information. However, indiscriminate alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2) Collaborative Noise Obscures Critical Behavioral SIDs: When modeling modality-specific information, prior works typically generate multiple behavioral SIDs with equal weights for each item. This equal-weight scheme fails to reflect the varying importance of different behavioral SIDs, making it difficult for downstream tasks to distinguish informative SIDs from noisy ones. To address these challenges, we propose ADC-SID, a framework that Adaptively Denoises Collaborative information for SID quantization. It comprises two key components: (i) Adaptive Behavior-Content Alignment, which adjusts alignment strength to mitigate corruption caused by collaborative noise; and (ii) Dynamic Behavioral Weighting Mechanism, which learns importance scores for behavioral SIDs to enable downstream models to suppress noise. Extensive experiments has demonstrated ADC-SID's superiority...

Yi Xu, Moyu Zhang, Chaofan Fan, Jinxin Hu, Xiaochen Li, Yu Zhang, Xiaoyi Zeng, Jing Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Discriminative RankingAmazon Beauty
AUC64.8
15
Discriminative RankingIndustrial Dataset
AUC0.7101
15
Generative RetrievalIndustrial Dataset
Reconstruction Loss0.0031
14
Discriminative RankingIndustrial Dataset (test)
Reconstruction Loss0.0031
7
Discriminative RankingAmazon Beauty (test)
Reconstruction Loss (L_recon)0.447
7
Discriminative RankingLarge-scale e-commerce platform Online Traffic A/B (test)
Advertising Revenue Lift1.56
1
Generative RetrievalLarge-scale e-commerce platform Online Traffic A/B (test)
Advertising Revenue Uplift3.5
1
Showing 7 of 7 rows

Other info

Follow for update