Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
About
Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP29.6 | 2643 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1201 | |
| Video Object Segmentation | DAVIS 2017 (val) | J mean55 | 1193 | |
| Image Classification | ImageNet-1K | -- | 600 | |
| Visual Object Tracking | TrackingNet (test) | Normalized Precision (Pnorm)83.5 | 463 | |
| Object Tracking | LaSoT | AUC64.7 | 411 | |
| Visual Object Tracking | GOT-10k (test) | Average Overlap67 | 408 | |
| Video Object Segmentation | DAVIS 2017 | Jaccard Index (J)55 | 82 | |
| Image Classification | ImageNet-1k (val) | Accuracy45.3 | 59 | |
| Unsupervised Object Discovery | PASCAL VOC 2012 | CorLoc50.2 | 42 |