Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgenticRed: Evolving Agentic Systems for Red-Teaming

About

While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem, and it autonomously evolves automated red-teaming systems using evolutionary selection and generational knowledge. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B, 98% on Llama-3-8B and 100% on Qwen3-8B on HarmBench. Our approach generates robust, query-agnostic red-teaming systems that transfer strongly to the latest proprietary models, achieving an impressive 100% ASR on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2. This work highlights evolutionary algorithms as a powerful approach to AI safety that can keep pace with rapidly evolving models.

Jiayi Yuan, Jonathan N\"other, Natasha Jaques, Goran Radanovi\'c• 2026

Related benchmarks

TaskDatasetResultRank
Red TeamingHarmBench Llama-3-8B (test)
ASR0.98
5
Red TeamingHarmBench Claude-Sonnet-3.5 (held-out test)
ASR60
5
Red TeamingHarmBench Llama-2-7B (test)
ASR96
5
Red TeamingHarmBench gpt-3.5-turbo-0125 (test)
ASR100
3
Red TeamingHarmBench gpt-4o-2024-08-06 (test)
ASR100
3
Showing 5 of 5 rows

Other info

Follow for update