Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

About

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy94
370
Science ReasoningGPQA Diamond
Accuracy83
34
Instruction Hierarchy RobustnessHuman Red-teaming IH-Challenge
Number of Tasks268
3
Instruction FollowingSystem IFEval
IFEval Score96
2
Instruction Hierarchy RobustnessGandalf Password (dev-user)
Score100
2
Instruction Hierarchy RobustnessTensorTrust sys-user
Score94
2
Instruction Hierarchy RobustnessTensorTrust user (dev)
Score91
2
Instruction Hierarchy RobustnessRealGuardrails Distractors
Score0.95
2
Instruction Hierarchy RobustnessRealGuardrails Handwritten
Score89
2
Instruction Hierarchy RobustnessTutor Jailbreak sys-user
Non-violation Rate99
2
Showing 10 of 21 rows

Other info

Follow for update