Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Generalizing Verifiable Instruction Following

About

A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi• 2025

Related benchmarks

TaskDatasetResultRank
Organic chemistryChemBench Organic Chemistry
Spearman Correlation0.12
8
Analytical chemistryChemBench Analytical Chemistry
Spearman Correlation-0.32
8
Inorganic chemistryChemBench Inorganic Chemistry
Spearman Correlation-0.34
8
Material scienceChemBench Material science
Spearman Correlation-0.43
8
Ranking Consistency AnalysisMMLU-Pro Anatomy health
Spearman Correlation-0.26
8
Physical chemistryChemBench Physical Chemistry
Spearman Correlation-0.61
8
Ranking Consistency AnalysisMMLU-Pro health Virology
Spearman Correlation0.04
8
Ranking Consistency AnalysisMMLU-Pro health Human aging
Spearman Correlation-0.32
8
Ranking Consistency AnalysisMMLU-Pro Medical genetics health
Spearman Correlation-0.64
8
Ranking Consistency AnalysisMMLU-Pro Nutrition health
Spearman Correlation-0.11
8
Showing 10 of 12 rows

Other info

Follow for update