Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RAD-DPO: Robust Adaptive Denoising Direct Preference Optimization for Generative Retrieval in E-commerce

About

Generative Retrieval (GR) is rapidly transforming e-commerce search by replacing traditional multi-stage pipelines with the autoregressive decoding of structured Semantic IDs (SIDs). Despite this architectural efficiency, aligning GR models with nuanced, real-world user preferences remains a critical challenge. While Direct Preference Optimization (DPO) offers an efficient alignment solution, its direct application to structured SIDs suffers from three limitations: (i) it penalizes shared hierarchical prefixes, causing gradient conflicts; (ii) it is vulnerable to noisy pseudo-negatives from implicit feedback; and (iii) in multi-label queries with multiple relevant items, it exacerbates a probability "squeezing effect" among valid candidates. To address these issues, we propose RAD-DPO, which introduces token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to mitigate label noise, and a multi-label global contrastive objective integrated with global SFT loss to explicitly expand positive coverage. Extensive offline evaluations and large-scale online A/B testing on JD.com's core search engine demonstrate that RAD-DPO achieves significant improvements in both retrieval precision and training efficiency, proving its robustness for massive industrial deployments.

Zhiguo Chen, Guohao Sun, Yiming Qiu, Xingzhi Yao, Mingming Li, Huimu Wang, Yangqi Zhang, Songlin Wang, Sulong Xu• 2026

Related benchmarks

TaskDatasetResultRank
Generative RetrievalJD.com industrial dataset (test)
Hallucination Rate6.47
8
Showing 1 of 1 rows

Other info

Follow for update