Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

About

As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, Christopher Parisien• 2024

Related benchmarks

TaskDatasetResultRank
Safety ClassificationSafeRLHF
F1 Score0.593
48
Response ClassificationEXPGUARD (test)
Financial Score91.8
40
Response Harmfulness DetectionXSTEST-RESP
Response Harmfulness F160.4
34
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)56.4
30
Prompt ClassificationEXPGUARD (test)
Financial Performance Score1.8
28
Safety ClassificationWildGuardMix (test)--
27
Response Harmfulness DetectionHarmBench
F1 Score62.2
23
Prompt Harmfulness ClassificationPublic Prompt Harmfulness Benchmarks (ToxicChat, OpenAI Moderation, AegisSafetyTest, SimpleSafetyTests, HarmBenchPrompt)
ToxiC Score73
19
Prompt Harmfulness DetectionText & Image Benchmarks Average
F1 Score73.83
19
Safety ModerationWILDJAILBREAK (val)
ASR0.9
18
Showing 10 of 16 rows

Other info

Follow for update