Self-Supervised Visual Representation Learning from Hierarchical Grouping
About
We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | PASCAL VOC 2012 (test) | mIoU64.7 | 1415 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU48.8 | 362 | |
| Semantic segmentation | PASCAL (val) | mIoU48.8 | 25 | |
| Semantic Segment Retrieval | PASCAL (val) | mIoU (7 classes)24.6 | 10 |