ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes
About
Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Grounding | Ego-Exo4D E views | Recall@136 | 10 | |
| Temporal Grounding | Ego-Exo4D M views | Recall@135 | 10 | |
| Salient Object Detection | DUTLF Focal Stack | MAE0.065 | 7 | |
| Keystep recognition | Ego-Exo4D | Top-1 Accuracy24.07 | 6 | |
| Keystep recognition | EPFL | Top-1 Accuracy19.24 | 6 | |
| Temporal Grounding | Ego-Exo4D D views | Recall@128 | 5 | |
| Temporal Grounding | EPFL D views | Recall@131 | 5 | |
| Keystep recognition | LEMMA | Top-1 Accuracy27.86 | 4 | |
| Temporal Grounding | LEMMA D views | Recall@118 | 4 |