Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

About

Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP29.6
2643
Instance SegmentationCOCO 2017 (val)--
1201
Video Object SegmentationDAVIS 2017 (val)
J mean55
1193
Image ClassificationImageNet-1K--
600
Visual Object TrackingTrackingNet (test)
Normalized Precision (Pnorm)83.5
463
Object TrackingLaSoT
AUC64.7
411
Visual Object TrackingGOT-10k (test)
Average Overlap67
408
Video Object SegmentationDAVIS 2017
Jaccard Index (J)55
82
Image ClassificationImageNet-1k (val)
Accuracy45.3
59
Unsupervised Object DiscoveryPASCAL VOC 2012
CorLoc50.2
42
Showing 10 of 19 rows

Other info

Follow for update