Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction
About
Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting a contrastive SSL framework for dense prediction, DeCon establishes consistent state-of-the-art performance on most of the evaluated tasks when pre-trained on Imagenet-1K, COCO and COCO+. Notably, when pre-training a ResNet-50 encoder on COCO dataset, DeCon improves COCO object detection and instance segmentation compared to the baseline framework by +0.37 AP and +0.32 AP, respectively, and boosts semantic segmentation by +1.42 mIoU on Pascal VOC and by +0.50 mIoU on Cityscapes. These improvements generalize across recent backbones, decoders, datasets, and dense tasks beyond segmentation and object detection, and persist in out-of-domain scenarios, including limited-data settings, demonstrating that joint pre-training significantly enhances representation quality for dense prediction. Code is available at https://github.com/sebquetin/DeCon.git.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU48.02 | 2731 | |
| Semantic segmentation | ADE20K | mIoU39.25 | 936 | |
| Object Detection | COCO (val) | -- | 613 | |
| Instance Segmentation | COCO (val) | APmk40.37 | 472 | |
| Object Detection | COCO | AP50 (Box)62.43 | 190 | |
| Semantic segmentation | ISIC (test) | mIoU83.66 | 59 | |
| Human Keypoint Detection | COCO | AP65.88 | 30 | |
| Semantic segmentation | PASCAL VOC 2007 (test) | mIoU75.4 | 29 | |
| Semantic segmentation | VOC (val) | mIoU73.81 | 25 | |
| Panoptic Segmentation | COCO | PQ40.9 | 23 |