Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

About

Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang• 2026

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP29.6	2930
Instance Segmentation	COCO 2017 (val)	--	1304
Video Object Segmentation	DAVIS 2017 (val)	J mean55	1251
Image Classification	ImageNet-1K	--	600
Object Tracking	LaSoT	AUC64.7	519
Visual Object Tracking	TrackingNet (test)	Normalized Precision (Pnorm)83.5	502
Visual Object Tracking	GOT-10k (test)	Average Overlap67	461
Semantic segmentation	ADE20K	mIoU30.6	90
Video Object Segmentation	DAVIS 2017	Jaccard Index (J)55	82
Image Classification	ImageNet-1k (val)	Accuracy45.3	64

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord