Seeing without Pixels: Perception from Camera Trajectories

About

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed provide valuable cues about "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han• 2025

Related benchmarks

Task	Dataset	Result
Proficiency estimation	Ego-Exo4D	Bouldering Proficiency Score65.41	16
Egocentric Text Retrieval	Ego-Exo4D	Overall Top-1 Accuracy46	16
Text Retrieval	DynPose-100K 1.0 (test)	Top-1 Accuracy46.3	8
Egocentric Text Retrieval	Nymeria	Top-1 Accuracy (legs)30.8	6
Keystep recognition	Ego-Exo4D	Recall Accuracy32.37	4
Keystep Localization	Ego-Exo4D	Rank@1 Accuracy (IoU=0.3)34.68	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord