Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Seeing without Pixels: Perception from Camera Trajectories

About

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed provide valuable cues about "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han• 2025

Related benchmarks

TaskDatasetResultRank
Proficiency estimationEgo-Exo4D
Bouldering Proficiency Score65.41
16
Egocentric Text RetrievalEgo-Exo4D
Physical iv Top-1 Accuracy56.1
8
Text RetrievalDynPose-100K 1.0 (test)
Top-1 Accuracy46.3
8
Egocentric Text RetrievalNymeria
Top-1 Accuracy (legs)30.8
6
Keystep recognitionEgo-Exo4D
Recall Accuracy32.37
4
Keystep LocalizationEgo-Exo4D
Rank@1 Accuracy (IoU=0.3)34.68
3
Showing 6 of 6 rows

Other info

Follow for update