Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences
About
We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: \href{https://github.com/kaist-silab/bootgen}{https://github.com/kaist-silab/bootgen}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Model-Based Optimization | UTR | 90th Percentile Oracle Score7.74 | 17 | |
| Offline Model-Based Optimization | D'Kitty | Oracle Score (90th Pctl)0.62 | 17 | |
| Offline Model-Based Optimization | GFP | 90th Percentile Oracle Score3.6 | 17 | |
| Offline Model-Based Optimization | ChEMBL | 90th Percentile Oracle Score0.61 | 17 | |
| Offline Model-Based Optimization | TF Bind 8 | 90th Percentile Oracle Score38.8 | 17 | |
| Model-Based Optimization | Design-Bench 2022 (test) | TF-Bind-8 Score0.979 | 16 | |
| Model-Based Optimization | Design-Bench | LogP-13 | 16 | |
| Offline Model-Based Optimization | LogP | 90th Percentile Oracle Score-116.8 | 16 | |
| Offline Model-Based Optimization | Warfarin | 90th Percentile Oracle Score549 | 15 |