Decoupled Residual Quantization for Robust Semantic IDs in Recommendation

About

Semantic IDs represent items as shared discrete token sequences and have become a practical tool for recommendation and retrieval. Yet it remains difficult to tell why a tokenizer fails: poor quality may come from codebook underutilization, unstable decision boundaries, or geometric distortion of the embedding space. This paper develops a quantitative framework for diagnosing these failures through expected codeword overlap and effective codebook capacity. The former measures expected codeword confusion under retrieval-time perturbation, while the latter converts that confusion into an effective number of usable, well-separated codes. The framework links semantic boundary confusion to both code usage imbalance and Euclidean geometric constraints. As a proof of concept, we present Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a large-scale industrial dataset show that Semantic ID quality is multi-objective: symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different aspects of a tokenizer. These downstream observations are based on one proprietary industrial dataset, so they should be read as a case study rather than a universal benchmark claim.

Xuesi Wang, Junjie Wang, Ziliang Wang, Weijie Bian, Guanxing Zhang• 2026

Related benchmarks

Task	Dataset	Result	Rank
Item-to-item retrieval	proprietary industrial short-video dataset	SID Embedding AUC91.21		5
Item-to-item retrieval	Proprietary industrial short-video dataset (Evaluation Pool)	Hit Rate Retention @200.9999		5

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord