Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Decoupled Residual Quantization for Robust Semantic IDs in Recommendation

About

Semantic IDs represent items as shared discrete token sequences and have become a practical tool for recommendation and retrieval. Yet it remains difficult to tell why a tokenizer fails: poor quality may come from codebook underutilization, unstable decision boundaries, or geometric distortion of the embedding space. This paper develops a quantitative framework for diagnosing these failures through expected codeword overlap and effective codebook capacity. The former measures expected codeword confusion under retrieval-time perturbation, while the latter converts that confusion into an effective number of usable, well-separated codes. The framework links semantic boundary confusion to both code usage imbalance and Euclidean geometric constraints. As a proof of concept, we present Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a large-scale industrial dataset show that Semantic ID quality is multi-objective: symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different aspects of a tokenizer. These downstream observations are based on one proprietary industrial dataset, so they should be read as a case study rather than a universal benchmark claim.

Xuesi Wang, Junjie Wang, Ziliang Wang, Weijie Bian, Guanxing Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Item-to-item retrievalproprietary industrial short-video dataset
SID Embedding AUC91.21
5
Item-to-item retrievalProprietary industrial short-video dataset (Evaluation Pool)
Hit Rate Retention @200.9999
5
Showing 2 of 2 rows

Other info

Follow for update