Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

About

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Nikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari• 2026

Related benchmarks

Task	Dataset	Result
Video Object Segmentation	DAVIS 2017 (val)	J mean72.5	1251
Semantic segmentation	ADE20K	mIoU47.5	1028
Semantic segmentation	COCO Stuff (val)	mIoU63.3	173
Surface Normal Estimation	NYUv2 (val)	--	26
Semantic segmentation	COCO-Stuff coarse set annotation (C=27) (held-out set (seen and 15 unseen categories))	mIoU (Seen)34.6	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord