Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

About

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet 1k (test)	Top-1 Accuracy85.4	880
Image Classification	ImageNet-1K	Top-1 Acc80.6	600
Semantic segmentation	ADE20K	mIoU41.2	559
Image Classification	VTAB	Overall Accuracy68.7	103
Semantic segmentation	Cityscapes	mIoU42.8	82
Image Classification	iNaturalist 2021	Top-1 Accuracy77.1	70
Semantic segmentation	ADE20K (train)	mIoU48.3	15

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord