Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Improving Training Stability for Multitask Ranking Models in Recommender Systems

About

Recommender systems play an important role in many content platforms. While most recommendation research is dedicated to designing better models to improve user experience, we found that research on stabilizing the training for such models is severely under-explored. As recommendation models become larger and more sophisticated, they are more susceptible to training instability issues, i.e., loss divergence, which can make the model unusable, waste significant resources and block model developments. In this paper, we share our findings and best practices we learned for improving the training stability of a real-world multitask ranking model for YouTube recommendations. We show some properties of the model that lead to unstable training and conjecture on the causes. Furthermore, based on our observations of training dynamics near the point of training instability, we hypothesize why existing solutions would fail, and propose a new algorithm to mitigate the limitations of existing solutions. Our experiments on YouTube production dataset show the proposed algorithm can significantly improve training stability while not compromising convergence, comparing with several commonly used baseline methods.

Jiaxi Tang, Yoel Drori, Daryl Chang, Maheswaran Sathiamoorthy, Justin Gilmer, Li Wei, Xinyang Yi, Lichan Hong, Ed H. Chi• 2023

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy54.85
776
Boolean Question AnsweringBoolQ
Accuracy57.46
307
Question AnsweringARC-E
Accuracy47.69
242
Multitask Language UnderstandingMMLU
Accuracy25.36
206
Science Question AnsweringSciQ
Normalized Accuracy73.4
44
Physical Commonsense ReasoningPIQA
Accuracy72.74
41
Question AnsweringOpenBookQA
Normalized Accuracy32.4
35
Commonsense ReasoningHellaSwag
HellaSwag Score53.34
27
Question AnsweringARC-C
Accuracy (Normalized)27.73
11
Natural Language Understanding and ReasoningStandard Downstream Benchmarks Two-Shot (val)
ARC-E Accuracy (Normalized)52.86
11
Showing 10 of 11 rows

Other info

Follow for update