Consensus Sampling for Safer Generative AI
About
Motivated by undetectable risks in generative AI, we study a general robust aggregation problem: how to aggregate several probability distributions to boost safety. We present consensus sampling, a black-box algorithm that, given k distributions, has risk competitive with the average risk of the safest $s$ while abstaining when there is insufficient agreement. This yields an architecture-agnostic approach to generative-model safety when the distributions are induced by models that can sample and evaluate output probabilities. We formalize the guarantee through R-robustness, which also bounds information leakage and adversarial influence. Inspired by robust statistics and the provable copyright protection algorithm of Vyas et al (2023), we show that while a standard mixture is vulnerable to one unsafe constituent, a pointwise-median construction provides robust intuition, and our efficient sampler is Pareto-optimal for the tradeoff between worst-case risk and abstention. Experiments on synthetic distributions and image generation illustrate the general mechanism and its motivating safety application. The method requires overlap among safe distributions, but it provides a model-agnostic way to inherit guarantees from an unknown reliable subset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Generative AI Output Safety | HarmBench and AdvBench (test) | Safe Rate20.39 | 8 |