Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Abstractive Red-Teaming of Language Model Character

About

We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.

Nate Rahn, Allison Qi, Avery Griffin, Jonathan Michala, Henry Sleight, Erik Jones• 2026

Related benchmarks

TaskDatasetResultRank
Red TeamingAI Supremacy principle v1 (test)
Mean Best Category Score11.7
12
Red TeamingReligious Discrimination principle v1 (test)
Mean Best Category Score5.32
12
Red TeamingIllegal Activity principle v1 (test)
Mean Score (Best Category)0.41
12
Showing 3 of 3 rows

Other info

Follow for update