Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Instruction Composition for Automated LLM Red-Teaming

About

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.

Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, Emily Chen• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench
Attack Success Rate (ASR)100
557
Adversarial AttackMistral-7B (successful attacks)
Unique Queries3.02e+3
3
Adversarial AttackLlama-3-70B successful attacks
Unique Queries Count1.32e+3
3
Adversarial AttackLlama-3.3-70B successful attacks
Unique Queries2.24e+3
3
Adversarial Attack Diversity AnalysisMistral-7B
Average Attack Similarity0.336
3
Adversarial Attack Diversity AnalysisLlama-3 70B
Average Attack Similarity35.2
3
Red TeamingMistral-7B
Attack Success Rate (ASR)56.7
3
Red TeamingLlama-3 70B
Attack Success Rate (ASR)45
3
Red TeamingLlama-3.3-70B
Attack Success Rate (ASR)0.558
3
Adversarial Attack Diversity AnalysisLlama 70B 3.3
Average Attack Similarity0.269
3
Showing 10 of 14 rows

Other info

Follow for update