A Holistic Approach to Undesired Content Detection in the Real World
About
We present a holistic approach to building a robust and useful natural language classification system for real-world content moderation. The success of such a system relies on a chain of carefully designed and executed steps, including the design of content taxonomies and labeling instructions, data quality control, an active learning pipeline to capture rare events, and a variety of methods to make the model robust and to avoid overfitting. Our moderation system is trained to detect a broad set of categories of undesired content, including sexual content, hateful content, violence, self-harm, and harassment. This approach generalizes to a wide range of different content taxonomies and can be used to create high-quality content classifiers that outperform off-the-shelf models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response Harmfulness Detection | HarmBench | F1 Score20.6 | 100 | |
| Response Harmfulness Detection | XSTEST-RESP | Response Harmfulness F146.6 | 76 | |
| Response Harmfulness Detection | Beavertails | F1 Score15.7 | 59 | |
| Harmfulness Detection | WildGuard | Macro F1 Score16.9 | 47 | |
| Harmfulness Detection | OpenAI Moderation | Macro F1 Score79 | 45 | |
| Toxicity Detection | ToxicChat | F1 Score0.254 | 45 | |
| Prompt Harmfulness Detection | AegisSafety (test) | F1 Score31.9 | 41 | |
| Response Harmfulness Detection | SafeRLHF | F1 Score10.1 | 41 | |
| Response Classification | EXPGUARD (test) | Financial Score0.00e+0 | 40 | |
| Prompt Classification | EXPGUARD (test) | Financial Performance Score0.00e+0 | 28 |