Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

About

Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.

Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek• 2025

Related benchmarks

TaskDatasetResultRank
Protein DesignPab1
Maximum Fitness1.978
12
Protein DesignAMIE
Mean Fitness (Top 100 Sequences)0.238
12
Protein DesignE4B
Mean Fitness (Top 100)8.017
12
Protein DesignGFP
Mean Fitness (Top 100)3.613
12
Protein DesignLGK
Mean Fitness (Top 100)0.04
12
Protein DesignPab1
Mean Fitness (Top 100)1.836
12
Protein DesignTEM
Mean Fitness (top 100)1.227
12
Protein DesignUBE2I
Mean Fitness (Top 100)2.987
12
Protein DesignAAV
Maximum Fitness0.72
12
Protein DesignAMIE
Maximum Fitness0.248
12
Showing 10 of 43 rows

Other info

Follow for update