Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

About

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH.

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao• 2024

Related benchmarks

TaskDatasetResultRank
Response Harmfulness DetectionXSTEST-RESP
Response Harmfulness F190.4
34
Safety ClassificationSafeRLHF
F1 Score0.647
32
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)76.8
30
Response ClassificationBeaverTails V Text-Image Response
F1 Score81.13
23
Response Harmfulness DetectionHarmBench
F1 Score81.6
23
Adversarial and Jailbreaking Attack DetectionJailbreakBench
AUROC0.7302
20
Adversarial and Jailbreaking Attack DetectionBeavertails
AUROC0.7779
20
Adversarial and Jailbreaking Attack DetectionMaliciousInstruct
AUROC0.7957
20
Adversarial and Jailbreaking Attack DetectionXSTest
AUROC0.7906
20
Adversarial and Jailbreaking Attack DetectionHarmBench
AUROC0.798
20
Showing 10 of 28 rows

Other info

Code

Follow for update