One protein is all you need
About
Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model's capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody-antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Protein fitness prediction | ProteinGym (test) | Avg. Spearman Correlation0.5087 | 14 | |
| Fitness Prediction | MaveDB subset of 50 proteins | Average Spearman Correlation0.5462 | 10 | |
| Protein Structure Prediction | CAMEO 18 low-confidence targets (test) | TM-score0.5047 | 10 | |
| Subcellular Localization Prediction | setHard (test) | Accuracy63.4 | 2 | |
| TPS substrate classification | TPS dataset (cross-validation) | mAP81.1 | 2 |