One protein is all you need

About

Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model's capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody-antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.

Anton Bushuiev, Roman Bushuiev, Olga Pimenova, Nikola Zadorozhny, Raman Samusevich, Elisabet Manaskova, Rachel Seongeun Kim, Hannes St\"ark, Jiri Sedlar, Martin Steinegger, Tom\'a\v{s} Pluskal, Josef Sivic• 2024

Related benchmarks

Task	Dataset	Result
Protein fitness prediction	ProteinGym (test)	Avg. Spearman Correlation0.5087	14
Fitness Prediction	MaveDB subset of 50 proteins	Average Spearman Correlation0.5462	10
Protein Structure Prediction	CAMEO 18 low-confidence targets (test)	TM-score0.5047	10
Subcellular Localization Prediction	setHard (test)	Accuracy63.4	2
TPS substrate classification	TPS dataset (cross-validation)	mAP81.1	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord