Taming the Long Tail: Denoising Collaborative Information for Robust Semantic ID Generation
About
Item IDs form the backbone of industrial recommender systems, but suffer from representation instability and poor long-tail generalization in large, dynamic item corpora. Semantic IDs (SIDs) mitigate these issues by enabling knowledge sharing through quantization of item content features. Existing methods attempt to enhance SID expressiveness by incorporating collaborative information with content features; however, they often overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed, resulting in a significant quality gap in collaborative information between popular and long-tail items. This mismatch leads to two critical limitations: (1) Collaborative Noise Corrupts Behavior-Content Alignment: Behavior-content alignment is a prevailing approach for modeling shared information. However, indiscriminate alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2) Collaborative Noise Obscures Critical Behavioral SIDs: When modeling modality-specific information, prior works typically generate multiple behavioral SIDs with equal weights for each item. This equal-weight scheme fails to reflect the varying importance of different behavioral SIDs, making it difficult for downstream tasks to distinguish informative SIDs from noisy ones. To address these challenges, we propose ADC-SID, a framework that Adaptively Denoises Collaborative information for SID quantization. It comprises two key components: (i) Adaptive Behavior-Content Alignment, which adjusts alignment strength to mitigate corruption caused by collaborative noise; and (ii) Dynamic Behavioral Weighting Mechanism, which learns importance scores for behavioral SIDs to enable downstream models to suppress noise. Extensive experiments has demonstrated ADC-SID's superiority...
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Discriminative Ranking | Amazon Beauty | AUC64.8 | 15 | |
| Discriminative Ranking | Industrial Dataset | AUC0.7101 | 15 | |
| Generative Retrieval | Industrial Dataset | Reconstruction Loss0.0031 | 14 | |
| Discriminative Ranking | Industrial Dataset (test) | Reconstruction Loss0.0031 | 7 | |
| Discriminative Ranking | Amazon Beauty (test) | Reconstruction Loss (L_recon)0.447 | 7 | |
| Discriminative Ranking | Large-scale e-commerce platform Online Traffic A/B (test) | Advertising Revenue Lift1.56 | 1 | |
| Generative Retrieval | Large-scale e-commerce platform Online Traffic A/B (test) | Advertising Revenue Uplift3.5 | 1 |