Consensus Sampling for Safer Generative AI

About

Motivated by undetectable risks in generative AI, we study a general robust aggregation problem: how to aggregate several probability distributions to boost safety. We present consensus sampling, a black-box algorithm that, given k distributions, has risk competitive with the average risk of the safest $s$ while abstaining when there is insufficient agreement. This yields an architecture-agnostic approach to generative-model safety when the distributions are induced by models that can sample and evaluate output probabilities. We formalize the guarantee through R-robustness, which also bounds information leakage and adversarial influence. Inspired by robust statistics and the provable copyright protection algorithm of Vyas et al (2023), we show that while a standard mixture is vulnerable to one unsafe constituent, a pointwise-median construction provides robust intuition, and our efficient sampler is Pareto-optimal for the tradeoff between worst-case risk and abstention. Experiments on synthetic distributions and image generation illustrate the general mechanism and its motivating safety application. The method requires overlap among safe distributions, but it provides a model-agnostic way to inherit guarantees from an unknown reliable subset.

Adam Tauman Kalai, Yael Tauman Kalai, Or Zamir• 2025

Related benchmarks

Task	Dataset	Result	Rank
Generative AI Output Safety	HarmBench and AdvBench (test)	Safe Rate20.39		8

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord