Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ProtSent: Protein Sentence Transformers

About

Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.

Dan Ofer, Oriel Perets, Michal Linial, Nadav Rappoport• 2026

Related benchmarks

TaskDatasetResultRank
Binary ClassificationNeuropeptide (NeuroPID)
AUC95.6
36
RegressionBeta-lactamase PEER
Spearman Correlation0.793
36
Structural retrievalSCOPe-40
Recall@150.7
34
Protein-level RegressionFluorescence TAPE
Spearman Correlation0.569
9
Binary ClassificationPPI Bernett
AUC59.2
4
Binary ClassificationPeptide-HLA Binding
AUC0.775
4
Binary ClassificationDeepSol
AUC71.9
4
Binary ClassificationSignal Peptide
AUC97.2
4
Binary ClassificationMetal Ion Binding
AUC0.843
4
Binary ClassificationMaterial Production
AUC0.759
4
Showing 10 of 24 rows

Other info

Follow for update