Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

About

The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time, even when not captured simultaneously or in the same environment. To this end, we propose AE2, a self-supervised embedding approach with two key designs: (1) an object-centric encoder that explicitly focuses on regions corresponding to hands and active objects; and (2) a contrastive-based alignment objective that leverages temporally reversed frames as negative samples. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets -- including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings.

Zihui Xue, Kristen Grauman• 2023

Related benchmarks

TaskDatasetResultRank
Action phase classificationBreak Eggs
F1 Score71.72
27
Frame retrievalBreak Eggs
mAP@1065.85
27
Action phase classificationPour Milk
F1 Score85.17
21
Action phase classificationPour Liquid
F1 Score66.56
21
Action phase classificationTennis Forehand
F1 Score85.87
21
Frame retrievalPour Milk
mAP@1084.9
21
Frame retrievalPour Liquid
mAP@1065.79
21
Frame retrievalTennis Forehand
mAP@100.8683
21
Action phase classificationDataset A
F1 Score (10%)63.95
10
Ego2exo Frame RetrievalCMU-MMAC
mAP@565.7
10
Showing 10 of 28 rows

Other info

Code

Follow for update