Taming the Long Tail: Denoising Collaborative Information for Robust Semantic ID Generation

About

Item IDs form the backbone of industrial recommender systems, but suffer from representation instability and poor long-tail generalization in large, dynamic item corpora. Semantic IDs (SIDs) mitigate these issues by enabling knowledge sharing through quantization of item content features. Existing methods attempt to enhance SID expressiveness by incorporating collaborative information with content features; however, they often overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed, resulting in a significant quality gap in collaborative information between popular and long-tail items. This mismatch leads to two critical limitations: (1) Collaborative Noise Corrupts Behavior-Content Alignment: Behavior-content alignment is a prevailing approach for modeling shared information. However, indiscriminate alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2) Collaborative Noise Obscures Critical Behavioral SIDs: When modeling modality-specific information, prior works typically generate multiple behavioral SIDs with equal weights for each item. This equal-weight scheme fails to reflect the varying importance of different behavioral SIDs, making it difficult for downstream tasks to distinguish informative SIDs from noisy ones. To address these challenges, we propose ADC-SID, a framework that Adaptively Denoises Collaborative information for SID quantization. It comprises two key components: (i) Adaptive Behavior-Content Alignment, which adjusts alignment strength to mitigate corruption caused by collaborative noise; and (ii) Dynamic Behavioral Weighting Mechanism, which learns importance scores for behavioral SIDs to enable downstream models to suppress noise. Extensive experiments has demonstrated ADC-SID's superiority...

Yi Xu, Moyu Zhang, Chaofan Fan, Jinxin Hu, Xiaochen Li, Yu Zhang, Xiaoyi Zeng, Jing Zhang• 2025

Related benchmarks

Task	Dataset	Result
Sequential Recommendation	Amazon Beauty (test)	NDCG@104.22	194
Sequential Recommendation	Amazon Toys (test)	NDCG@54.01	44
Sequential Recommendation	Amazon Sports (test)	NDCG@50.0206	42
Discriminative Ranking	Amazon Beauty	AUC64.8	15
Discriminative Ranking	Industrial Dataset	AUC0.7101	15
Generative Retrieval	Industrial Dataset	Reconstruction Loss0.0031	14
Discriminative Ranking	Industrial Dataset (test)	Reconstruction Loss0.0031	7
Discriminative Ranking	Amazon Beauty (test)	Reconstruction Loss (L_recon)0.447	7
Recommendation	Industrial 35M-item catalog (test)	R@5027.72	7
Discriminative Ranking	Large-scale e-commerce platform Online Traffic A/B (test)	Advertising Revenue Lift1.56	1

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord