Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AI safety via debate

About

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

Geoffrey Irving, Paul Christiano, Dario Amodei• 2018

Related benchmarks

TaskDatasetResultRank
Software ArchitectureDCI Evaluation Suite Arch.
Quality Score9.19
5
Decision MakingDeliberative decision-making tasks n=45 (overall)
Mean Tokens3.20e+4
5
Disagreement HandlingDCI Evaluation Suite Disagree
Quality Score8.68
5
Hidden-Profile IntegrationDCI Evaluation Suite Hidden-Prof
Quality Score9.3
5
Late-Evidence AnalysisDCI Evaluation Suite Late-Evid.
Quality Score9.26
5
Process Artifact AnalysisDeliberative Decision-Making Evaluation Set
Decision Packet Completeness16
5
Reasoning evaluationFull task set (n=45)
Overall Score8.78
5
Routine Task ManagementDCI Evaluation Suite Routine
Quality Score8.86
5
Policy AnalysisDCI Evaluation Suite Policy
Quality Score8.26
5
Risk AssessmentDCI Evaluation Suite Risk
Quality Score8.03
5
Showing 10 of 10 rows

Other info

Follow for update