Reverse Distillation: Consistently Scaling Protein Language Model Representations
About
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse-distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mutational effect prediction | ProteinGym DMS 1 and 2 mutation set (1 mutation) | Spearman Correlation0.904 | 12 | |
| Mutational effect prediction | ProteinGym DMS 1 and 2 mutation set (2 mutation) | Spearman Correlation0.72 | 12 | |
| 3-class secondary structure prediction | SSP Q3 | AUPR86.1 | 6 | |
| 8-class secondary structure prediction | SSP Q8 | AUPR43.1 | 6 | |
| Metal ion binding prediction | MIB | AUPR90.1 | 6 | |
| Mutational effect prediction | ProteinGym DMS (>2 mutations set) (3 mutation split) | Spearman Correlation0.652 | 6 | |
| Mutational effect prediction | ProteinGym DMS >2 mutations set (4 mutation split) | Spearman Correlation0.615 | 6 | |
| R2/R1 prediction | R2 R1 | AUPR46.8 | 6 | |
| Localization prediction | LOC | AUPR74.3 | 6 |