Constitutional AI: Harmlessness from AI Feedback
About
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | MT-Bench | MT-Bench Score6.22 | 189 | |
| Mathematical Reasoning | MATH (test) | Pass@171.1 | 151 | |
| Instruction Following | AlpacaEval | Win Rate74.9 | 125 | |
| Multimodal Reasoning | MMBench | -- | 50 | |
| Visual Reasoning and Instruction Following | MM-Vet | Overall Score38 | 23 | |
| Visual Instruction Following | LLaVA-Bench | -- | 8 | |
| Visual Multi-Choice | POPE | Accuracy88.3 | 6 | |
| Attack Resilience Evaluation | 51,750 Adversarial Samples | Resilience Score (Log)42.7 | 5 | |
| Security Analysis | Security Tasks 15,000 benign samples (test) | F1 (Log)92.7 | 5 | |
| Adversarial Attack Defense | Held-out attacks (test) | ASR (Multi-turn Manip.)52.3 | 2 |