Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

About

The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher

Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	Cityscapes to Foggy Cityscapes (test)	mAP55.4	196
Object Detection	Cityscapes -> Foggy Cityscapes	mAP55.4	73
Object Detection	BDD100K (val)	mAP47.8	71
Object Detection	Cityscapes → BDD100k	Truck AP44.3	18
Domain Adaptive Object Detection	Foggy Cityscapes (val)	AP (Person)48.5	18
Object Detection	ACDC	mAP50 (Fog)68.6	16
Object Detection	Foggy Cityscapes full (val)	AP (Person)48.5	15
Object Detection	ACDC Fog (test)	mAP68.6	3
Object Detection	ACDC Night (test)	mAP36.4	3
Object Detection	ACDC Rain (test)	mAP39	3

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord