Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

About

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet 1k (test)
Top-1 Accuracy85.4
848
Image ClassificationImageNet-1K
Top-1 Acc80.6
600
Semantic segmentationADE20K
mIoU41.2
366
Image ClassificationVTAB
Overall Accuracy68.7
103
Semantic segmentationCityscapes
mIoU42.8
82
Image ClassificationiNaturalist 2021
Top-1 Accuracy77.1
70
Semantic segmentationADE20K (train)
mIoU48.3
15
Showing 7 of 7 rows

Other info

Follow for update