ProtSent: Protein Sentence Transformers
About
Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Binary Classification | Neuropeptide (NeuroPID) | AUC95.6 | 36 | |
| Regression | Beta-lactamase PEER | Spearman Correlation0.793 | 36 | |
| Structural retrieval | SCOPe-40 | Recall@150.7 | 34 | |
| Protein-level Regression | Fluorescence TAPE | Spearman Correlation0.569 | 9 | |
| Binary Classification | PPI Bernett | AUC59.2 | 4 | |
| Binary Classification | Peptide-HLA Binding | AUC0.775 | 4 | |
| Binary Classification | DeepSol | AUC71.9 | 4 | |
| Binary Classification | Signal Peptide | AUC97.2 | 4 | |
| Binary Classification | Metal Ion Binding | AUC0.843 | 4 | |
| Binary Classification | Material Production | AUC0.759 | 4 |