ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

About

Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman• 2025

Related benchmarks

Task	Dataset	Result
Temporal Grounding	Ego-Exo4D E views	Recall@136	10
Temporal Grounding	Ego-Exo4D M views	Recall@135	10
Salient Object Detection	DUTLF Focal Stack	MAE0.065	7
Keystep recognition	Ego-Exo4D	Top-1 Accuracy24.07	6
Keystep recognition	EPFL	Top-1 Accuracy19.24	6
Temporal Grounding	Ego-Exo4D D views	Recall@128	5
Temporal Grounding	EPFL D views	Recall@131	5
Keystep recognition	LEMMA	Top-1 Accuracy27.86	4
Temporal Grounding	LEMMA D views	Recall@118	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord